-
Notifications
You must be signed in to change notification settings - Fork 13
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bump Parthenon and Kokkos #114
Conversation
Looks like I still need to update the new hst file name in the tests. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
@par-hermes format |
I think I caught everything now and tests pass again. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The code looks fine, but the MPI regression test looks like it's still failing.
tst/regression/test_suites/aniso_therm_cond_gauss_conv/aniso_therm_cond_gauss_conv.py
Show resolved
Hide resolved
The MPI regression error messages are very odd, and they all happen for the cluster_magetic_tower test:
OpenMPI errors:
Actual error that causes the regression to fail:
|
This is all so annoying...
So I went back to the cuda 11.6 and ubuntu 20.04 container and am now testing various combinations of scipy, h5py, and numpy with by default have some incompatibilities due to use of deprecated interfaces... Such a mess... |
I've gotten OpenMPI 5 + UCX to work outside of a container with CUDA-awareness, so I'm a bit surprised by that combination. Does it work outside the container? Does |
Leaving this for posterity -- sth about the IPC is odd (note the
|
I as getting around a couple of errors like
by disabling CUDA IPC alltogether: |
I cannot believe it. Victory!!! btw this unmerged (open-mpi/ompi#12137) doc (https://github.com/open-mpi/ompi/blob/909168e501b7eb144d4a361a88938af99c1a4352/docs/tuning-apps/networking/cuda.rst) was quite helpful |
For future reference, OLCF has a Dockerfile for CUDA-aware MPI here: For future reference, OLCF has example CUDA-aware MPI Dockerfiles here: https://code.ornl.gov/olcfcontainers/olcfbaseimages/-/blob/master/summit/mpiimage-centos-cuda/Dockerfile?ref_type=heads. It looks like they download the pre-built UCX and OpenMPI from NVIDIA/Mellanox:
|
@pgrete @BenWibking I'm a little late to the party, it took LAMMPS running into the same issue running GPUDirect Cray MPICH and a new version of Kokkos to hit the same errors with IPC. I just figured this one out yesterday. The core issue is that the Kokkos recently made To switch to the new IPC, Kokkos would need to create a CUDA memory pool separate from the default pool. One would then need to get that memory pool object to create a file descriptor to pass to the MPI framework/other process to access that memory. So the total solution would involve changes to both Kokkos and the comm libraries. For now you could disable IPC in the MPI library, or you can disable it in Kokkos with Unless we're calling |
@fglines-nv I recently stumbled upon this issue with AthenaK on ALCF Polaris with Cray MPICH, and your links to the other Issues led me here. Do you know when/what version Kokkos made this change? I think we started having issues at 4.2.00, see here: kokkos/kokkos#7294 We have been recompiling with The Kokkos team removed the comprehensive list of CMake build flags from |
@felker Supposedly this PR in Kokkos kokkos/kokkos#6402. The flag should be |
Updates Parthenon to 24.08 and Kokkos to 4.4.0 (both released last month).
Changes to the interface are described in the Changelog.