Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multiple tests from jdk_jfr_2 fail on arm32 due to dump related issues #3115

Open
Haroon-Khel opened this issue Nov 2, 2021 · 9 comments
Open

Comments

@Haroon-Khel
Copy link
Contributor

Many failed tests in jdk_jfr_2 in the extended openjdk suite on arm32

The list of failed tests are in https://ci.adoptopenjdk.net/job/Test_openjdk8_hs_extended.openjdk_arm_linux_testList_2/16/testReport/
There are 335 failed tests in total so wont be posting all of them here

All of the test failures have a similar error log

[GC (Allocation Failure) [DefNew: 4416K->512K(4928K), 0.0075890 secs] 4416K->1241K(15872K), 0.0077235 secs] [Times: user=0.00 sys=0.00, real=0.00 secs] 
[GC (Allocation Failure) [DefNew: 4928K->463K(4928K), 0.0157184 secs] 5657K->1697K(15872K), 0.0157842 secs] [Times: user=0.00 sys=0.00, real=0.02 secs] 
[GC (Allocation Failure) [DefNew: 4879K->344K(4928K), 0.0102744 secs] 6113K->2030K(15872K), 0.0105317 secs] [Times: user=0.02 sys=0.00, real=0.01 secs] 
[Full GC (System.gc()) [Tenured: 1685K->1798K(10944K), 0.0291196 secs] 6292K->1798K(15872K), [Metaspace: 4583K->4583K(5424K)], 0.0292266 secs] [Times: user=0.02 sys=0.00, real=0.03 secs] 
[thread -759499664 also had an error]
#
# A fatal error has been detected by the Java Runtime Environment:
#
#  SIGBUS (0x7) at pc=0xf5f48698, pid=423216, tid=0xd3038470
#
# JRE version: OpenJDK Runtime Environment (8.0_312-b07) (build 1.8.0_312-b07)
# Java VM: OpenJDK Client VM (25.312-b07 mixed mode linux-aarch32 )
# Problematic frame:
# V  [libjvm.so+0x333698]  write_checkpoint_header(unsigned char*, long long, long long, bool, unsigned int)+0xe8
#
# Core dump written. Default location: /home/jenkins/workspace/Test_openjdk8_hs_extended.openjdk_arm_linux_testList_2/aqa-tests/TKG/output_16358029827425/jdk_jfr_2/work/scratch/15/core or core.423216
#
# An error report file with more information is saved as:
# /home/jenkins/workspace/Test_openjdk8_hs_extended.openjdk_arm_linux_testList_2/aqa-tests/TKG/output_16358029827425/jdk_jfr_2/work/scratch/15/hs_err_pid423216.log
@roberttoyonaga
Copy link
Contributor

roberttoyonaga commented Jul 29, 2024

I've been investigating this problem a little. Here's what I've found:

Root cause

Compiler optimizations in hotspot/src/cpu/aarch32/vm/bytes_aarch32.hpp break things when building for aarch32 in aarch32 containers running on a aarch64 host. The result is a SIGBUS address alignment error (as shown in the issue description above).

Solution

Upgrade the code in bytes_aarch32.hpp to the newer implementation in JDK11/JDK17 (recommended). Or turn off optimizations for that file (not recommended).

Explanation

The problems only manifest when using aarch32 containers on aarch64 host. The problems disappear when using
an aarch32 host directly (ie. test-sxa-armv7l-ubuntu2004-odroid-2). [references: 1, 2, 3]

After reproducing the jtreg failures in an aarch32 container on an aarch64 host, the hs_err file complains about a SIGBUS address alignment issue stemming from the following JFR functions:

  • write_checkpoint_header(unsigned char*, long long, long long, bool, unsigned int)
  • JfrPeriodicEventSet::requestDoubleFlag()

I was also able to reproduce the same problem using a simple "hello world" java app, when running with JFR. But there was no problem when running without -XX:StartFlightRecording (JFR disabled). So the problem was specific to JFR, but not isolated to the test cases.

First, I wanted to figure out why JFR specifically was causing the SIBGUS. I suspected that it's because JFR does a lot of low level memory writing when committing events. Indeed it turns out that JFR uses bytes_aarch32.hpp when writing event data to its in-memory buffers. Code in bytes_aarch32.hpp deals with reading/writing directly to addresses — which would explain the SIGBUS address alignment errors. I suspect that the SIGBUS error doesn't point to code in this file directly due optimizations (write_checkpoint_header() and requestDoubleFlag() eventually call code in this file).

Out of curiosity, I create a slowdebug build (no optimizations) and found I could not reproduce any of the problems. This led me to believe that some optimizations were being done for aarch32 that did not work on the aarch64 host. I confirmed this theory by adding directives to disable optimizations #pragma GCC push_options #pragma GCC optimize ("O0") to bytes_aarch32.hpp and generated a fresh release build. The new release build with those directives also had no problems.

So at this point it was determined that the SIGBUS alignment error was likely caused by optimizations in the bytes_aarch32.hpp code that deals with read/writing bytes directly, and that JFR is uniquely affected because JFR does a lot of low level reading/writing to event buffers whenever an event is committed.

Next, I investigated why this problem only manifests in JDK8, but not in JDK 11 and 17 [references 1, 2 ]. The reason is because after JDK8, bytes_aarch32.hpp was converted to bytes_arm.hpp. The newer bytes_arm.hpp is much more careful about address alignment, and thus does not trigger SIGBUS.

To confirm this, I swapped out the old bytes_aarch32.hpp for the new bytes_arm.hpp, and the problems disappeared.

@sophia-guo
Copy link
Contributor

Thanks @roberttoyonaga! Currently all test agents of arm32 are the docker hosted and I believe it's aarch64 host @Haroon-Khel , is that correct ?

https://ci.adoptium.net/label/ci.role.test&&sw.os.linux&&hw.arch.aarch32/

If that's the case I would suggest to disable those tests for jdk8 arm32, thoughts @ShelleyLambert ?

@smlambert
Copy link
Contributor

FYI, my active GitHub account is @smlambert (other one tied to an old work email).

Excluding the testcases because we do not have appropriate hardware to run them is fine, it should be done in the vendor problem list though, https://github.com/adoptium/aqa-tests/tree/master/openjdk/excludes/vendors/eclipse, so we do not affect others who may wish to run these tests and have appropriate hardware.

@Haroon-Khel
Copy link
Contributor Author

Thanks @roberttoyonaga! Currently all test agents of arm32 are the docker hosted and I believe it's aarch64 host @Haroon-Khel , is that correct ?

Yes, the arm32 machines on which our nightly tests run are docker hosted, on arm64 hosts. The 2 odroid actual arm32 machines do not have the ci.role.test tag so our nightly tests do not run on them.

@roberttoyonaga
Copy link
Contributor

Do you think it would be better just to update hotspot/src/cpu/aarch32/vm/bytes_aarch32.hpp? This file only exists in the Adoptium repo not in the OpenJDK repo.

@sophia-guo
Copy link
Contributor

Upstream the file is under https://hg.openjdk.org/aarch32-port/jdk8u/hotspot/file/5ee36e3a5a61/src/cpu/aarch32/vm/bytes_aarch32.hpp. Adoptium mirrored upstream. Would it be possible or easy to update upstream?

@roberttoyonaga
Copy link
Contributor

ohh I see. Ok in that case, maybe its better just to exclude the tests in the problem list files

@roberttoyonaga
Copy link
Contributor

I've made a PR here to exclude the failing tests for eclipse only. #5469
Once that is merged, is it safe to close this GH issue?

@roberttoyonaga
Copy link
Contributor

#5469 (comment) has been merged.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Todo
Development

No branches or pull requests

4 participants