Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bootstrap issues with 4.9.0 image #12

Closed
saschagrunert opened this issue Dec 22, 2021 · 11 comments · Fixed by #31
Closed

Bootstrap issues with 4.9.0 image #12

saschagrunert opened this issue Dec 22, 2021 · 11 comments · Fixed by #31
Labels
bug Something isn't working

Comments

@saschagrunert
Copy link
Contributor

saschagrunert commented Dec 22, 2021

Hey, I have the issue that the bootstrap node is not bootable any more. The installations seems to be fine, then it restarts the machine from the iPXE process. Then the reboot got stuck with:

Booting from Hard drive C:
..
error: ../../grub-core/disk/i386/pc/biosdisk.c:498:failure reading sector 0x0
from `cd'.

I tried provisioning multiple facilities (da11, ams6, fra1) without any success.


It also happens that the CoreOS kernel boot screen of grub appears, but then the screen turns black via the out of band console. Pinging the machine is possible but not accessing any service like ssh.

@displague displague added the bug Something isn't working label Jan 3, 2022
@displague
Copy link
Member

displague commented Jan 3, 2022

I ran into different problems with the bootstrap node on reboot, #10 - these were not OS boot related.

What device plan were you using, @saschagrunert ?

@saschagrunert
Copy link
Contributor Author

@displague do you mean the machine type? I recently tried c3.small as well as c3.medium (Dell R6515) and in both cases the nodes encounter a black screen followed by a reboot after the coreos boot screen selector. The iPXE installation exits with a success indicator.

So I assume it’s something in the kernel boot parameters. 🤔

@orenc1
Copy link

orenc1 commented Jan 11, 2022

I'm now encountering the very same issue as well while trying to spin up an OCP 4.9 on Equinix Metal in DC13 facility.
The Out-of-Band console shows a black screen for all servers (except the lb), or stuck with:

$ ssh d186e620-c6ba-44c1-88e2-b6de75031825@sos.dc13.platformequinix.com
[SOS Session Ready. Use ~? for help.]
[Note: You may need to press RETURN or Ctrl+L to get a prompt.]

This is the bootstrap.ipxe configuration that is being used:

#!ipxe

set release 4.9
set zstream 0
set arch x86_64
set coreos-url http://147.28.129.183:8080
set coreos-img ${coreos-url}/rhcos-${release}.${zstream}-${arch}-live-rootfs.${arch}.img
set console console=ttyS1,115200n8
 
kernel ${coreos-url}/rhcos-${release}.${zstream}-${arch}-live-kernel-${arch} initrd=main coreos.live.rootfs_url=${coreos-img} coreos.inst.install_dev=sda coreos.inst.ignition_url=http://147.28.129.183:8080/bootstrap-append.ign ${console} console=tty0 console=ttyS0,115200n8 ip=dhcp
initrd --name main ${coreos-url}/rhcos-${release}.${zstream}-${arch}-live-initramfs.${arch}.img
boot

Please advise,
Thanks

@displague
Copy link
Member

I see the console is repeated in the kernel args there - on both ttyS0 and ttyS1. That doesn't sound right, but I don't know if that should create a problem (other than logs from the "getty")

@displague
Copy link
Member

displague commented Jan 11, 2022

I'm also getting black screens on the SoS console for the control plane and worker nodes. SSH is not responsive either.

I see that the control plane nodes are configured with an IPXE Script URL of http://{lb-0 address}:8080/master.ipxe - this URL 404s

@displague
Copy link
Member

displague commented Jan 11, 2022

The ipxe scripts are not found on the bastion node (/usr/share/nginx/html/*.ipxe)

@displague
Copy link
Member

displague commented Jan 11, 2022

From an empty state, I applied the following individually

terraform apply -target 'module.bastion.null_resource.ignition_append_files["master"]'
terraform apply -target 'module.bastion.null_resource.ipxe_files'
terraform apply -target 'module.bastion.null_resource.ocp_install_ignition'

With this approach, /usr/share/nginx/html/ contained the files that were not present in the previous pass. I didn't target all of the files that are supposed to be in this directory.

I then ran a full terraform apply and observed the following warnings or errors:

module.prepare_openshift.null_resource.ocp_installer: Creating...
module.prepare_openshift.null_resource.ocp_installer: Provisioning with 'file'...
module.prepare_openshift.null_resource.ocp_installer: Provisioning with 'remote-exec'...
module.prepare_openshift.null_resource.ocp_installer (remote-exec): Connecting to remote host via SSH...
...
module.prepare_openshift.null_resource.ocp_installer (remote-exec): gzip: stdin: not in gzip format
module.prepare_openshift.null_resource.ocp_installer (remote-exec): tar: Child returned status 1
module.prepare_openshift.null_resource.ocp_installer (remote-exec): tar: Error is not recoverable: exiting now

module.prepare_openshift.null_resource.ocp_installer (remote-exec): gzip: stdin: not in gzip format
module.prepare_openshift.null_resource.ocp_installer (remote-exec): tar: Child returned status 1
module.prepare_openshift.null_resource.ocp_installer (remote-exec): tar: Error is not recoverable: exiting now
module.prepare_openshift.null_resource.ocp_installer (remote-exec): cp: cannot stat ‘oc’: No such file or directory
module.prepare_openshift.null_resource.ocp_installer: Creation complete after 2s [id=894906645089249242]
module.prepare_openshift.null_resource.ocp_pullsecret: Creating...
module.prepare_openshift.null_resource.ocp_pullsecret: Provisioning with 'file'...
module.prepare_openshift.null_resource.ocp_pullsecret: Provisioning with 'remote-exec'...
module.prepare_openshift.null_resource.ocp_pullsecret (remote-exec): (output suppressed due to sensitive value in config)
module.prepare_openshift.null_resource.ocp_pullsecret (remote-exec): (output suppressed due to sensitive value in config)
module.prepare_openshift.null_resource.ocp_pullsecret: Creation complete after 3s [id=2213855489334895954]
module.prepare_openshift.data.template_file.installer_config: Reading...
module.prepare_openshift.data.template_file.installer_config: Read complete after 0s [id=a974cd42f9ea73442c095c53a67da5fd85a58726cde2a4f1aaa6a6d05324e2d6]
module.prepare_openshift.null_resource.ocp_install_config: Creating...
module.prepare_openshift.null_resource.ocp_install_config: Provisioning with 'file'...
module.prepare_openshift.null_resource.ocp_install_config: Provisioning with 'remote-exec'...
module.prepare_openshift.null_resource.ocp_install_config (remote-exec): Connecting to remote host via SSH...
...
module.prepare_openshift.null_resource.ocp_install_config (remote-exec): Connected!
module.prepare_openshift.null_resource.ocp_install_config: Creation complete after 2s [id=2694546928241472463]
module.prepare_openshift.null_resource.ocp_install_manifests: Creating...
module.prepare_openshift.null_resource.ocp_install_manifests: Provisioning with 'remote-exec'...
module.prepare_openshift.null_resource.ocp_install_manifests (remote-exec): Connecting to remote host via SSH...
...
module.prepare_openshift.null_resource.ocp_install_manifests (remote-exec): Connected!
module.prepare_openshift.null_resource.ocp_install_manifests (remote-exec): /tmp/terraform_1825922558.sh: line 6: /tmp/artifacts/openshift-install: No such file or directory
module.prepare_openshift.null_resource.ocp_install_manifests (remote-exec): sed: can't read /tmp/artifacts/install/manifests/cluster-scheduler-02-config.yml: No such file or directory
module.prepare_openshift.null_resource.ocp_install_manifests (remote-exec): /tmp/terraform_1825922558.sh: line 8: /tmp/artifacts/openshift-install: No such file or directory
module.prepare_openshift.null_resource.ocp_install_manifests (remote-exec): cp: cannot stat ‘/tmp/artifacts/install/*.ign’: No such file or directory
module.prepare_openshift.null_resource.ocp_install_manifests (remote-exec): 506 Cannot talk to daemon
module.prepare_openshift.null_resource.ocp_install_manifests: Still creating... [10s elapsed]
module.prepare_openshift.null_resource.ocp_install_manifests (remote-exec): 200 OK
module.prepare_openshift.null_resource.ocp_install_manifests: Creation complete after 12s [id=5925476240016676298]
null_resource.get_kubeconfig: Creating...
null_resource.get_kubeconfig: Provisioning with 'local-exec'...
null_resource.get_kubeconfig (local-exec): Executing: ["/bin/sh" "-c" "mkdir -p ./auth; scp -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -i /Users/marques/.ssh/id_rsa_mos-v8nqj [email protected]:/tmp/artifacts/install/auth/* ./auth/"]
module.openshift_controlplane.metal_device.node[2]: Creating...
module.openshift_workers.metal_device.node[0]: Creating...
module.openshift_workers.metal_device.node[1]: Creating...
module.openshift_controlplane.metal_device.node[1]: Creating...
module.openshift_controlplane.metal_device.node[0]: Creating...
module.openshift_bootstrap.metal_device.node[0]: Creating...
null_resource.get_kubeconfig (local-exec): Warning: Permanently added '139.178.84.39' (ED25519) to the list of known hosts.
null_resource.get_kubeconfig (local-exec): scp: /tmp/artifacts/install/auth/*: No such file or directory
module.openshift_controlplane.metal_device.node[2]: Still creating... [10s elapsed]

The control planes nodes do not seem to be accessible again.

@orenc1
Copy link

orenc1 commented Jan 11, 2022

When I ran terraform with:

ocp_version=4.9
ocp_version_zstream=0

or:

ocp_version=4.8
ocp_version_zstream=14

which are corresponding to RHCOS 4.9.0 and 4.8.14 respectively, the /usr/share/nginx/html folder on the bastion/lb host was populated with these files:

root@lb-0 ~]# ll /usr/share/nginx/html/
total 1000852
-rwxr-xr-x. 1 root root      3971 Oct  7  2019 404.html
-rwxr-xr-x. 1 root root      4020 Oct  7  2019 50x.html
-rwxr-xr-x. 1 root root      1175 Jan 11 11:41 bootstrap-append.ign
-rwxr-xr-x. 1 root root       606 Jan 11 11:41 bootstrap.ipxe
-rwxr-xr-x. 1 root root      1172 Jan 11 11:41 master-append.ign
-rwxr-xr-x. 1 root root       603 Jan 11 11:41 master.ipxe
-rwxr-xr-x. 1 root root  89362572 Jan 11 11:41 rhcos-4.8.14-x86_64-live-initramfs.x86_64.img
-rwxr-xr-x. 1 root root  10030448 Jan 11 11:41 rhcos-4.8.14-x86_64-live-kernel-x86_64
-rwxr-xr-x. 1 root root 925434368 Jan 11 11:41 rhcos-4.8.14-x86_64-live-rootfs.x86_64.img
-rwxr-xr-x. 1 root root      1172 Jan 11 11:41 worker-append.ign
-rwxr-xr-x. 1 root root       603 Jan 11 11:41 worker.ipxe

as expected, and the files are indeed accessible, e.g. http://147.28.129.183:8080/bootstrap.ipxe

and I've also seen the following error regarding copying from /tmp/artifacts/install/auth/* in the local host. What should populate that folder?

Error: Error running command 'mkdir -p ./auth; scp -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -i /home/ocohen/.ssh/id_rsa_metal-4cfi5 [email protected]:/tmp/artifacts/install/auth/* ./auth/': exit status 1. Output: Warning: Permanently added '147.28.129.183' (ECDSA) to the list of known hosts.
scp: /tmp/artifacts/install/auth/*: No such file or directory

@saschagrunert
Copy link
Contributor Author

saschagrunert commented Jan 12, 2022

@displague I had 404's when running terraform apply sequentially without a git clean -fdx in between. I think it does not download the image correctly when running multiple times. From a clean state it always downloads the image and I never encountered 404 errors. 🤷

@displague
Copy link
Member

It looks like #20 will be addressing some of these concerns.

@displague
Copy link
Member

displague commented Jun 13, 2024

A number of the problems discussed here have been previously resolved in #20.

Error: Error running command 'mkdir -p ./auth; scp -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -i /home/ocohen/.ssh/id_rsa_metal-4cfi5 [email protected]:/tmp/artifacts/install/auth/* ./auth/': exit status 1. Output: Warning: Permanently added '147.28.129.183' (ECDSA) to the list of known hosts.
scp: /tmp/artifacts/install/auth/*: No such file or directory

This was experienced and fixed in #31

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
3 participants