Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

QPC: All platforms are unstable #2121

Open
Willsparker opened this issue Apr 13, 2021 · 23 comments
Open

QPC: All platforms are unstable #2121

Willsparker opened this issue Apr 13, 2021 · 23 comments

Comments

@Willsparker
Copy link
Contributor

Ref: https://ci.adoptopenjdk.net/job/QEMUPlaybookCheck/229/

With the latest QPC run, all platforms have failed to some degree. 4 have failed due calling buildJDK.sh in an incorrect way (incorrect as of #1962 ).

The risc-v platform is still blocked by #1483

And the arm32 platform seems to be running out of space during the playbook execution.

@Willsparker
Copy link
Contributor Author

Changes made to qemuPlaybookCheck.sh have been made to use the correct arguments (willsparker/2123_1). Testing on QPC at: https://ci.adoptopenjdk.net/job/QEMUPlaybookCheck/230/

@Willsparker
Copy link
Contributor Author

The above PR should fix the buildJDK.sh issues. Once this has been merged, I'll look at running the arm32 box, to determine if there's still issues with the space.

@Willsparker
Copy link
Contributor Author

The PR to fix the arguments has been merged, and it appears to be working on all platforms. For the S390x and ppc64le architectures, the linux.sh script fails for JDK8, as it is unable to find JDK7 on the machines (as Zulu-7 is not installed). I thought that they were meant to default then to using JDK8, but apparently not. This is not the case for JDK11, where I have had the build start on both machines (though currently, I've only managed a QPC run that had ppc64le fail (maybe due to the wrong version of gcc..?), and s390x that core dumped (maybe due to a bad jdk-10 install..?))

I'm going to look into that platform-specific-configuration script, to see if it's meant to set the boot jdk to the build jdk in cases wehre the boot jdk can't be found. If it isn't, I'll look into install zulu-7 on those platforms.

@sxa
Copy link
Member

sxa commented Apr 27, 2021

It can always be overriden with JDK7_BOOT_DIR if needed, but I know we've tried to let the autodetection work.
My preference is for it not to fall back if possible, since you then run the risk of being unsure whether from one build to the next it's used a JDK7 or JDK8 as the boot JDK. Our "real" CentOS/RHEL7 ppc64le and s390x build machines all use a true JDK7.

@Willsparker
Copy link
Contributor Author

Willsparker commented Apr 29, 2021

Okay! Thanks for the guidance. Looks like I'll have to adapt the Zulu-7 ansible task to include those platforms :-)

EDIT: Turns out no, this can't be done (see: https://adoptium.slack.com/archives/C53GHCXL4/p1619777906294700). Best case is having those platforms use JDK-8 to build

@Willsparker
Copy link
Contributor Author

With #2176 (comment) , I think the last part of fixing the non RISC-V platforms is to figure out how to extend the build images, as they run out of space halfway through a build.

@Willsparker
Copy link
Contributor Author

Willsparker commented May 13, 2021

QEMU qcow2 images can be resized easily enough with
qemu-img resize $IMG +10G (if you wanted to add 10GB to it).
However, the partitions still need to be extended, which will be specific to the platform, and is likely to be a massive pain. Helpfully, I think all of the platforms that fail due to disk issues will fail in the build. Therefore, we could mount it as a new partition and just build on that one if it's that difficult.

EDIT: AH hang on, I've already documented all of this. Okay, just going to extend them by 10GB, and create a new partition on /home/linux that will have that extra 10GB. This method should be fine for all images that don't have a swap partitions. I'll create backups of the current images, just in case :-)

@Willsparker
Copy link
Contributor Author

List of what I did to extend each of the images. After extending partitions, I'll re-compress the image with xz -z, and move the images into /home/jenkins/qemu_base_images/resized_images/, that way we still have backups of the old images

  • Debian10.aarch64.dsk :

  • Image is qcow2, so extend by running qemu img resize $IMAGE +10G

When logging into the machine, it automatically extended the root partition, which was lovely to see. So that has 25GB on it now.

  • Debian11.riscv.dsk:

  • qcow2 image, so same command as above.

  • This one has 10GB mounted on the root partition, and 5GB on /home/linux. So, I'll delete the /home/linux partition and create a new one in the same place, of size 15GB:

root@debian:~# lsblk
NAME   MAJ:MIN RM SIZE RO TYPE MOUNTPOINT
vda    254:0    0  25G  0 disk 
|-vda1 254:1    0  10G  0 part /
`-vda2 254:2    0   5G  0 part /home/linux
root@debian:~# fdisk /dev/vda
Command (m for help): d
Partition number (1,2, default 2): 2

Partition 2 has been deleted.

Command (m for help): n
Partition number (2-128, default 2): 2
First sector (20971393-52428766, default 20971520): 
Last sector, +/-sectors or +/-size{K,M,G,T,P} (20971520-52428766, default 52428766): 

Created a new partition 2 of type 'Linux filesystem' and of size 15 GiB.
Partition #2 contains a ext4 signature.

Do you want to remove the signature? [Y]es/[N]o: Y

The signature will be removed by a write command.
Command (m for help): w
The partition table has been altered.
Syncing disks.

root@debian:~# lsblk
NAME   MAJ:MIN RM SIZE RO TYPE MOUNTPOINT
vda    254:0    0  25G  0 disk 
|-vda1 254:1    0  10G  0 part /
`-vda2 254:2    0  15G  0 part /home/linux
  • Debian8.Arm32.dsk:

  • Still a qcow2 image, so, as above

This ones a bit different, there's a lot more partitions:

$ lsblk
NAME   MAJ:MIN RM  SIZE RO TYPE MOUNTPOINT
vda    254:0    0   30G  0 disk 
├─vda1 254:1    0  243M  0 part /boot
├─vda2 254:2    0  5.3G  0 part /
├─vda3 254:3    0    1K  0 part 
├─vda5 254:5    0  502M  0 part [SWAP]
└─vda6 254:6    0   14G  0 part /home

Unfortunately, I can't seem to find a way of extending the root partition, without the following error / issue on startup

[  155.371104] FS-Cache: Netfs 'nfs' registered for caching
Welcome to emergency mode! After logging in, type "journalctl -xb" to view
system logs, "systemctl reboot" to reboot, "systemctl default" to try again
to boot into default mode.
[  155.698568] Installing knfsd (copyright (C) 1996 [email protected]).
Give root password for maintenance
(or type Control-D to continue): 

I tried removing partitions 2-6, and recreating the / and /home partition, but the same issue occurs. I tried just removing the second and third partition, and it caused the rest of them to disappear as well. I'm going to skip this one for now, in the hopes that somebody else knows how to do this ( @sxa ...? 👀 )

@Willsparker
Copy link
Contributor Author

Willsparker commented May 18, 2021

  • Ubuntu1804.arm64.dsk:

The disk format is raw. I was able to extend it with qemu-img resize <> +10G though, and it is showing up in lsblk:

NAME                             MAJ:MIN RM  SIZE RO TYPE MOUNTPOINT
vda                              252:0    0   30G  0 disk 
├─vda1                           252:1    0  512M  0 part /boot/efi
└─vda2                           252:2    0 19.5G  0 part 
  ├─ubuntu--18--arm64--vg-root   253:0    0 18.6G  0 lvm  /
  └─ubuntu--18--arm64--vg-swap_1 253:1    0  976M  0 lvm  [SWAP]

However, whenever I run any fdisk commands, I get
GPT PMBR size mismatch (41943039 != 62914559) will be corrected by w(rite)., which fails when I exit, with fdisk: failed to write disklabel: Invalid argument. For some reason, fdisk can't fix this, but parted can. Running parted -l will prompt a fix/ignore check. After this, I followed the instructions from this link - created a new partition with fdisk (called /dev/vda3) with the remaining space, created the physical volume pvcreate /dev/vda3, extended the volume group with vgextend /dev/ubuntu-18-arm64-vg /dev/vda3. From there, I extended the LV that was needed lvextend +2568 /dev/ubuntu-18-arm64-vg/root and resized it to the file system : resize2fs /dev/ubuntu-18-arm64-vg/root.
The result is:

root@ubuntu-18-arm64:~# lsblk
NAME                             MAJ:MIN RM  SIZE RO TYPE MOUNTPOINT
vda                              252:0    0   30G  0 disk 
├─vda1                           252:1    0  512M  0 part /boot/efi
├─vda2                           252:2    0 19.5G  0 part 
│ ├─ubuntu--18--arm64--vg-root   253:0    0 28.6G  0 lvm  /
│ └─ubuntu--18--arm64--vg-swap_1 253:1    0  976M  0 lvm  [SWAP]
└─vda3                           252:3    0   10G  0 part 
  └─ubuntu--18--arm64--vg-root   253:0    0 28.6G  0 lvm  /

Which looks alright to me 👍

  • Ubuntu18.ppc64le.dsk:

Also uses LVM, and a raw disk image, so I'll do the same process as before, hopefully. Initial lsblk shows (after extending the disk):

NAME                 MAJ:MIN RM  SIZE RO TYPE MOUNTPOINT
sda                    8:0    0   25G  0 disk 
├─sda1                 8:1    0    7M  0 part 
└─sda2                 8:2    0   15G  0 part 
  ├─linux--vg-root   253:0    0 14.3G  0 lvm  /
  └─linux--vg-swap_1 253:1    0  676M  0 lvm  [SWAP]
sr0                   11:0    1 1024M  0 rom 

I wonder if sr0 there is going to cause issues ..
Nope, didn't seem to. Having followed the same instructions as before (and instructions on how to extend the LVM here):

root@linux:~# lsblk
NAME                 MAJ:MIN RM  SIZE RO TYPE MOUNTPOINT
sda                    8:0    0   25G  0 disk 
├─sda1                 8:1    0    7M  0 part 
├─sda2                 8:2    0   15G  0 part 
│ ├─linux--vg-root   253:0    0 24.3G  0 lvm  /
│ └─linux--vg-swap_1 253:1    0  676M  0 lvm  [SWAP]
└─sda3                 8:3    0   10G  0 part 
  └─linux--vg-root   253:0    0 24.3G  0 lvm  /
sr0                   11:0    1 1024M  0 rom 
  • Ubuntu18.S390x.dsk :

This one is a QCOW2 image, so same command as all of them. This one is also LVM, but for some reason, didn't require parted -l.

root@ubuntu:~# lsblk
NAME                  MAJ:MIN RM  SIZE RO TYPE MOUNTPOINT
vda                   252:0    0   25G  0 disk 
└─vda1                252:1    0   15G  0 part 
  ├─ubuntu--vg-root   253:0    0   14G  0 lvm  /
  └─ubuntu--vg-swap_1 253:1    0  964M  0 lvm  [SWAP]

After the same above process:

NAME                  MAJ:MIN RM  SIZE RO TYPE MOUNTPOINT
vda                   252:0    0   25G  0 disk 
├─vda1                252:1    0   15G  0 part 
│ ├─ubuntu--vg-root   253:0    0 24.1G  0 lvm  /
│ └─ubuntu--vg-swap_1 253:1    0  964M  0 lvm  [SWAP]
└─vda2                252:2    0   10G  0 part 
  └─ubuntu--vg-root   253:0    0 24.1G  0 lvm  /

I'm glad I/whoever setup the Ubuntu Images, had the foresight to make the partitions LVM :-)

@karianna karianna added this to the May 2021 milestone May 18, 2021
@Willsparker
Copy link
Contributor Author

Willsparker commented May 19, 2021

With all but the Debian.ARM32 image extended, I'll swap the original images with the resized images to test on the vagrant server. I'm aware that now QPC can run on build-equinix-ubuntu2004-armv8-1, so I'll quickly reconfigure the job to just use the infra-ibmcloud-vagrant* machines for the time being. If all goes well, I'll move the images over :-)

See:
QPC#264

EDIT: Looks like the Debian ones failed - it looks like debian10/aarch64 didn't boot in time, and something else wrong with debian11/riscv64. Not too worried about RISCV, as it is currently running fine in QPC#263

@Willsparker
Copy link
Contributor Author

Willsparker commented May 20, 2021

Nope, I messed up the RISCV one - this happened 3 times now:

08:46:58 /usr/bin/ssh-copy-id: INFO: attempting to log in with the new key(s), to filter out any that are already installed

08:46:59 /usr/bin/ssh-copy-id: INFO: 1 key(s) remain to be installed -- if you are prompted now it is to install the new keys

08:47:00 mkdir: cannot create directory '.ssh': Permission denied

I would assume in the process of extending the partition, the Linux user no longer owns their own home directory. Redid that, and tested here: QPC#267

It still looks like the Debian10/aarch64 one isn't booting in time. I'll try a local test where I extend the boot time to 180s, to see if this fixes it.

Bright side, both Ubuntu18/ppc64le and Ubuntu18/aarch64 seem to be fine after the image resize, in QPC#264- ppc64le actually passed, and aarch64 failed the test, though it was able to fully complete a build, which is nice.

Ubuntu18/s390x appears to be failing on apt-get upgrade :

Processing triggers for initramfs-tools (0.130ubuntu3.12) ...
update-initramfs: Generating /boot/initrd.img-4.15.0-74-generic
Using config file '/etc/zipl.conf'
Run /lib/s390-tools/zipl_helper.device-mapper /boot
Error: Unsupported setup: Directory '/boot' is located on a multi-target device-mapper device
Error: Script could not determine target parameters
run-parts: /etc/initramfs/post-update.d//zz-zipl exited with return code 1
dpkg: error processing package initramfs-tools (--configure):
 installed initramfs-tools package post-installation script subprocess returned error exit status 1
Errors were encountered while processing:
 linux-firmware
 initramfs-tools
E: Sub-process /usr/bin/dpkg returned an error code (1)

Presumably this is an issue with the root file system now being on a several partitions. Interesting that this wasn't an issue with the other Ubuntu machines. I could apt hold the linux-firmware and initramfs-tools packages to 'fix' the issue, but that feels wrong ...

@Haroon-Khel
Copy link
Contributor

Latest failures/instabilities

ppc64le ubuntu18 - fails in the test stage. Looks like the test script needs to be updated to support the new name of the tests repo

00:17:27 TESTDIR: /home/linux/testLocation/openjdk-tests is invalid. Please use --testdir|-t to set valid TESTDIR under aqa-tests. Default value current dir (pwd) is used if not provided.
00:17:32 /home/linux/openjdk-infrastructure/ansible/pbTestScripts/testJDK.sh: line 15: cd: /home/linux/testLocation/openjdk-tests/TKG: No such file or directory
00:17:32 + grep -q 'FAILED: 0' /home/vagrant1/workspace/QEMUPlaybookCheck/ARCHITECTURE/ppc64le/OS/ubuntu18/label/vagrant/ansible/pbTestScripts/qemu_pbCheck/logFiles/UBUNTU18.PPC64LE.test_log
00:17:32 + echo TEST FAILED

s390x ubuntu18 - fails when ansible tries running apt-get upgrade, https://ci.adoptopenjdk.net/job/QEMUPlaybookCheck/272/ARCHITECTURE=s390x,OS=ubuntu18,label=vagrant/consoleFull

12:07:41 TASK [Common : Run apt-get upgrade] ********************************************
12:43:05 fatal: [localhost]: FAILED! => {"changed": false, "msg": "'/usr/bin/apt-get upgrade --with-new-pkgs ' failed: E: Sub-process /usr/bin/dpkg returned an error code (1)\n", "rc": 100, "stdout": "Reading package lists...\nBuilding dependency tree...\nReading state information...\nCalculating upgrade...\nThe following NEW packages will be installed:\n  distro-info gcc-11-base libgcc-s1 libnetplan0 linux-headers-4.15.0-144\n  linux-headers-4.15.0-144-generic linux-image-4.15.0-144-generic\n  linux-modules-4.15.0-144-generic linux-modules-extra-4.15.0-144-generic\nThe following packages will be upgraded:\n  accountsservice apt apt-utils base-files bind9 bind9-doc bind9-host\n  bind9utils bsdutils busybox-initramfs busybox-static ca-certificates dbus\n  distro-info-data dmeventd dmsetup dnsutils e2fsprogs fdisk file\n  friendly-recovery gcc-8-base initramfs-tools initramfs-tools-...

aarch64 ubuntu18 - disconnected during build stage, playbook passes without error

01:01:35 Connection to localhost closed by remote host.

aarch64 debian10 - Connection reset by peer

12:01:00 TASK [Gathering Facts] *********************************************************
12:01:00 fatal: [localhost]: UNREACHABLE! => {"changed": false, "msg": "Failed to connect to the host via ssh: kex_exchange_identification: read: Connection reset by peer\r\nConnection reset by 127.0.0.1 port 10020", "unreachable": true}

arm32 debian8 - Fails to download GCC 7.5 binary. URL is invalid

14:15:17 TASK [gcc_7 : Download AdoptOpenJDK gcc-7.5.0 binary] **************************
14:15:35 fatal: [localhost]: FAILED! => {"changed": false, "dest": "/tmp/ansible-adoptopenjdk-gcc-7.tar.xz", "elapsed": 6, "msg": "Request failed", "response": "HTTP Error 404: Not Found", "status_code": 404, "url": "https://ci.adoptopenjdk.net/userContent/gcc/gcc750+ccache.armv7l.tar.xz"}
14:15:35 

riscv seems to be successful in its playbook run

@sxa
Copy link
Member

sxa commented Jun 22, 2021

Looks like the test script needs to be updated to support the new name of the tests repo

Hmmm all requests to the old repo should redirect so there may be another underlying issue there ...

@Haroon-Khel
Copy link
Contributor

Haroon-Khel commented Jun 22, 2021

Hmmm all requests to the old repo should redirect so there may be another underlying issue there ...

I received this error in my own time when I was running some tests using the tests repo. This error would pop up when using the ./get.sh script. I noticed that I had only began to hit this error when the repo name changed from openjdk-tests to aqa-tests. I was using a local copy of the tests repo that I had cloned before the name change. I found that changing the folder name to aqa-tests, from openjdk-tests solved it. Ive put in the pr #2230, ill be testing it to see if it solves it

@sxa
Copy link
Member

sxa commented Jun 22, 2021

arm32 debian8 - Fails to download GCC 7.5 binary. URL is invalid

That looks from the log as though the job was tested using a fork of the repository that doesn't have this change in it: https://github.com/adoptium/infrastructure/pull/2201/files

@Haroon-Khel
Copy link
Contributor

That looks from the log as though the job was tested using a fork of the repository that doesn't have this change in it: https://github.com/adoptium/infrastructure/pull/2201/files

Possibly. I was testing this pr, #2203 (I shouldve just tested the playbook run in hindsight since the changes only affect mac), at the time. Ill run a new job on master

@sxa sxa added this to the 2023-02 (February) milestone Jan 27, 2023
@sxa
Copy link
Member

sxa commented Jan 30, 2023

As of 19/12/22 the failing builds are

aarch64 deb10

23:03:53 TASK [Common : Allow https apt sources] ****************************************
23:04:36 [WARNING]: Updating cache and auto-installing missing dependency: python-apt
23:04:36 fatal: [localhost]: FAILED! => {"changed": false, "cmd": "apt-get update", "msg": "E: Repository http://deb.debian.org/debian buster InRelease' changed its 'Suite' value from 'stable' to 'oldstable'

That one will be from Debian 10 going out of support, so we should either have something in the playbooks to update the apt repo reference. See https://wiki.debian.org/DebianOldStable for background.

s390x ubuntu18

23:06:25 TASK [Common : Run apt-get upgrade] ********************************************
...
 "Error: Unsupported setup: Directory '/boot' is located on a multi-target device-mapper device",

@Willsparker Have you hit this one before?

riscv deb11

23:08:13 TASK [Common : Allow https apt sources] ****************************************
23:08:34 fatal: [localhost]: FAILED! => {"cache_update_time": 1598344993, "cache_updated": false, "changed": false, "msg": "'/usr/bin/apt-get -y -o \"Dpkg::Options::=--force-confdef\" -o \"Dpkg::Options::=--force-confold\"      install 'apt-transport-https'' failed: E: Failed to fetch http://deb.debian.org/debian-ports/pool/main/a/apt/apt-transport-https_2.1.10_all.deb  404  Not Found [IP: 199.232.10.132 80]

That's slightly odd and suggests that apt-get update has not been able to complete successfully since the directory mentioned there does have version 2.5.5 of the package in place.

arm32 deb8 Failed to install the following package for what looked to be a dependency reason:

failed: [localhost] (item=flex)
failed: [localhost] (item=g++)

May just be because Debian8 is ancient (although it might be interesting to see if we can point directly at the correct repo from archive.debian.org). However we should probably check what the latest Raspbian image is and go with that (and possibly add a recent Ubuntu since that is become more common on the platform) EDIT: Current is based on Debian 11 (Bullseye) with kernel 5.15, the "legacy" image is Debian 10 (Buster) with kernel 5.10) Reference

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
No open projects
Status: No status
Development

No branches or pull requests

4 participants