Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Nightly build failures #1092

Closed
jewatkins opened this issue Nov 21, 2024 · 19 comments · Fixed by #1096
Closed

Nightly build failures #1092

jewatkins opened this issue Nov 21, 2024 · 19 comments · Fixed by #1096
Labels
Testing Stuff related to testing Albany (including nightly tests)

Comments

@jewatkins
Copy link
Collaborator

Couple builds broken this morning. It looks like it's due to recent PRs. @mperego do you want to take a look?

@jewatkins jewatkins added the Testing Stuff related to testing Albany (including nightly tests) label Nov 21, 2024
@mperego
Copy link
Collaborator

mperego commented Nov 21, 2024

A fix is coming up soon

mperego added a commit that referenced this issue Nov 21, 2024
@ikalash
Copy link
Collaborator

ikalash commented Nov 26, 2024

I believe there are new errors from omegah + Kokkos: https://sems-cdash-son.sandia.gov/cdash//build/312099/configure . There were others on mockba and camobap which I fixed already but those won't show up until tomorrow in the nightlies.

@mperego
Copy link
Collaborator

mperego commented Nov 26, 2024

Yeah, Kokkos 4.5.0 was snapshot into Trilinos yesterday
trilinos/Trilinos#13589

@ikalash
Copy link
Collaborator

ikalash commented Nov 26, 2024

Is someone working on fixing it?

@mcarlson801
Copy link
Collaborator

mcarlson801 commented Nov 26, 2024

Looks like omega_h has a requirement for kokkos 3.7 here: https://github.com/SCOREC/omega_h/blob/94aa915568abb01c169fc5b84bbbd12acede8935/CMakeLists.txt#L60

I assume this would need to be updated to reflect the new version.

Edit: tagging @cwsmith in case he hasn't seen this thread

@jewatkins
Copy link
Collaborator Author

I turned off omega_h on blake for now. There are a lot of failing tests on weaver...

@mperego
Copy link
Collaborator

mperego commented Nov 27, 2024

Most of landice tests failing with NAN residuals. I'll try to have a look in the afternoon.

@jewatkins
Copy link
Collaborator Author

at least most of the gpu tests are also failing on cpu. there are a few exceptions:

corePDEs_ContinuationHeat1D | Failed | 3s 140ms | Completed (Failed) | Unstable | Unstable
corePDEs_Ioss3D | Failed | 4s 270ms | Completed (Failed) | Unstable | Unstable
corePDEs_SideSetLaplacian_3D | Failed | 5s 520ms | Completed (Failed) | Broken | Unstable
corePDEs_SteadyHeat2D | Failed | 2s 530ms | Completed (Failed) | Unstable | Unstable
corePDEs_SteadyHeat2D_SERIAL | Failed | 4s 340ms | Completed (Failed) | Unstable | Unstable
corePDEs_SteadyHeat3D | Failed | 5s 840ms | Completed (Failed) | Unstable | Unstable
corePDEs_SteadyHeat3D_10x10x10_ascii | Failed | 6s 340ms | Completed (Failed) | Unstable | Unstable
corePDEs_SteadyHeat3D_nodeGIDArrayResponse | Failed | 4s 920ms | Completed (Failed) | Unstable | Unstable
corePDEs_SteadyHeatConstrainedOpt2D_Scalar_And_Dist_Param | Failed | 3s 230ms | Completed (Failed) | Unstable | Unstable
demoPDEs_ThermoElectrostatics2D | Failed | 2s 390ms | Completed (Failed) | Unstable | Unstable

@mperego
Copy link
Collaborator

mperego commented Nov 27, 2024

yeah, let's start fixing the tests on cpus

@cwsmith
Copy link
Collaborator

cwsmith commented Nov 27, 2024

Thanks @mcarlson801 .

My local build of omegah (SCOREC/omega_h@94aa915) with the kokkos (4.5.00) cuda backend was successful.

Can someone post, or email me, the omegah relevant portions of the build log?

@jewatkins
Copy link
Collaborator Author

Thanks @mcarlson801 .

My local build of omegah (SCOREC/omega_h@94aa915) with the kokkos (4.5.00) cuda backend was successful.

Can someone post, or email me, the omegah relevant portions of the build log?

looks like it's a configure issue:

CMake Error at /home/projects/albany/nightlyCDashAlbanyBlake/build-intel/AlbBuildReleaseIntel/tpls/omega_h/omega_h-src/cmake/bob.cmake:346 (find_package):
  Could not find a package configuration file provided by "Kokkos" (requested
  version 3.7) with any of the following names:

    KokkosConfig.cmake
    kokkos-config.cmake

  Add the installation prefix of "Kokkos" to CMAKE_PREFIX_PATH or set
  "Kokkos_DIR" to a directory containing one of the above files.  If "Kokkos"
  provides a separate development package or SDK, be sure it has been
  installed.
Call Stack (most recent call first):
  /home/projects/albany/nightlyCDashAlbanyBlake/build-intel/AlbBuildReleaseIntel/tpls/omega_h/omega_h-src/CMakeLists.txt:63 (bob_add_dependency)

@mcarlson801
Copy link
Collaborator

Most of landice tests failing with NAN residuals. I'll try to have a look in the afternoon.

I'm looking at this on cpu as well so I'll let you know if I find anything before then.

@cwsmith
Copy link
Collaborator

cwsmith commented Nov 27, 2024

A PR for fixing the Omega_h configure issue is here: #1095.

@mperego
Copy link
Collaborator

mperego commented Nov 28, 2024

I've looked into it a bit, and it seems that fields are not loaded correctly and can contain garbage.
As an example, when running the test landIce_FO_Test_Dome_interpSurf, when loading the surface_height field, at line
https://github.com/sandialabs/Albany/blob/master/src/evaluators/state/PHAL_LoadStateField_Def.hpp#L62
stateData contain values like 6.92382e-310 together with reasonable values (Residual evaluation).

That's all I was able to figure out for today.

@jewatkins
Copy link
Collaborator Author

maybe some incompatibility with @bartgol custom dual view classes?

@bartgol
Copy link
Collaborator

bartgol commented Nov 28, 2024

I will have to check next week.

@bartgol
Copy link
Collaborator

bartgol commented Dec 2, 2024

I don't see anything wrong in the code. The only think I can think of is that the dyn rank view may have the wrong size... But I don't see why that would happen. I need to dig a bit.

@mcarlson801
Copy link
Collaborator

I dug a little deeper into LoadStateField and it looks like we can rule out sync issues with the state data dual view and issues with the MDField iterator. All three fields in LoadStateField are in agreement, it's just the data is bad.

From albany-serial-bad (LandIce_FO_Dome_Ascii):

21: [DEBUG] LoadStateField(flow_factor)<Residual>
21: [DEBUG] dataVec = [3.17098e-24 3.17098e-24 3.17098e-24 3.17098e-24 3.17098e-24 3.17098e-24 3.17098e-24 3.17098e-24 3.17098e-24 3.17098e-24 ]
21: [DEBUG] stateData(dev) = [3.17098e-24 3.17098e-24 3.17098e-24 3.17098e-24 3.17098e-24 3.17098e-24 3.17098e-24 3.17098e-24 3.17098e-24 3.17098e-24 ]
21: [DEBUG] stateData(host) = [3.17098e-24 3.17098e-24 3.17098e-24 3.17098e-24 3.17098e-24 3.17098e-24 3.17098e-24 3.17098e-24 3.17098e-24 3.17098e-24 ]

From albany-serial-good (LandIce_FO_Dome_Ascii):

21: [DEBUG] LoadStateField(flow_factor)<Residual>
21: [DEBUG] dataVec = [0.0001 0.0001 0.0001 0.0001 0.0001 0.0001 0.0001 0.0001 0.0001 0.0001 ]
21: [DEBUG] stateData(dev) = [0.0001 0.0001 0.0001 0.0001 0.0001 0.0001 0.0001 0.0001 0.0001 0.0001 ]
21: [DEBUG] stateData(host) = [0.0001 0.0001 0.0001 0.0001 0.0001 0.0001 0.0001 0.0001 0.0001 0.0001 ]

@bartgol
Copy link
Collaborator

bartgol commented Dec 2, 2024

Interesting. So the issue is that the field is probably never correctly loaded into the STK structures when the mesh is created?

mperego added a commit that referenced this issue Dec 4, 2024
Apparently linear access of Kokkos dynamic rank views is no longer working
mperego added a commit that referenced this issue Dec 4, 2024
Apparently linear access of Kokkos dynamic rank views is no longer working
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Testing Stuff related to testing Albany (including nightly tests)
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants