-
Notifications
You must be signed in to change notification settings - Fork 570
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
kokkos: snapshot from commit a5eb4d4e is causing our application to hang toward the end of the simulation #13351
Comments
@glhenni @vqd8a the 4.4 release included thread safety fixes that exposed issues with some incorrect usages of Views that showed up in a couple places in Trilinos resulting in a deadlock/hang of tests. The most common cases were due to View creation/destruction within parallel regions, often times with View-of-View's usage where creation and/or destruction were not properly handled. Based on your report and hanging tests, I suspect something similar might be occurring? |
@glhenni a new tool is in progress that was very helpful in finding the View usage issues in Trilinos, kokkos/kokkos-tools#267 , I suggest running the using test with this tool to see if any culprit usage is flagged |
I think I am seeing this in Nalu.... However, my final bisect iteration does not actually build: commit f8ff2ad (HEAD)
[ 45%] Building CXX object packages/kokkos/containers/src/CMakeFiles/kokkoscontainers.dir/impl/Kokkos_UnorderedMap_impl.cpp.o |
@spdomin did the build failure occur with a clean build? If needed you can disable mdspan with the |
This build is part of a bisect to figure out the hang I am seeing. I configure Trilinos each step. So, yes, I think this is a clean build. |
I added:
Sorry, I am somewhat taking over this support ticket... I will post back if our new hang points to this commit, while using some of the advice given above. |
@spdomin can you post your Trilinos configuration reproducer? We saw similar issues in Trilinos builds like you posted above that were resolved by kokkos/kokkos#7103 (included in the 4.4 snapshot), we'll need to reproduce and open an issue to figure out why that does not help in your configuration |
I use this: with:
The current build with the new MDSPAN=OFF is proceeding. |
There you go:) `a5eb4d4e1436e5594ce73ffe62e1cb0f460c99b0 is the first bad commit
` I will review the notes above. Offhand, I do not know about this view-of-views pattern in Nalu... |
@spdomin the tool can be used more generally beyond View of Views to detect allocation/deallocation/fences within parallel regions and such (the naming was initially inspired by the first cases that showed up with this issue). If the hang is caused by something along these lines, then the tool will be helpful to list the potentially culprit View(s) |
@spdomin so far I am not able to reproduce the compilation error you saw. I tested on solo, which had the closest match I could find to modules that you listed, and pared back some of the configuration script you pointed to - the error occurs in kokkos, so enabling netcdf and packages using it like seacas etc. was not necessary to try to reproduce. The error should occur just attempting to build the kokkos library, though I enabled kokkos tests for added coverage but no luck. Here is what I tried on solo with sha 1eb0af7 (includes the snapshot sha listed above) # environment
module load gnu/10.3.1 openmpi-gnu/4.1 cmake
export blas_install_lib=/usr/lib64/libblas.so.3
export lapack_install_lib=/usr/lib64/liblapack.so.3
# build dir
mkdir -p Build
cd Build
# configuration
export TRILINOS_DIR=<path-to-Trilinos>
cmake \
-DCMAKE_INSTALL_PREFIX=$PWD/install \
-DTrilinos_ENABLE_CXX11=ON \
-DCMAKE_BUILD_TYPE=RELEASE \
-DTrilinos_ENABLE_EXPLICIT_INSTANTIATION:BOOL=ON \
-DTpetra_INST_DOUBLE:BOOL=ON \
-DTpetra_INST_INT_LONG:BOOL=ON \
-DTpetra_INST_INT_LONG_LONG:BOOL=OFF \
-DTpetra_INST_COMPLEX_DOUBLE=OFF \
-DTrilinos_ENABLE_TESTS:BOOL=OFF \
-DTrilinos_ENABLE_ALL_OPTIONAL_PACKAGES=OFF \
-DTrilinos_ALLOW_NO_PACKAGES:BOOL=OFF \
-DTPL_ENABLE_MPI=ON \
-DTPL_ENABLE_SuperLU=OFF \
-DTPL_ENABLE_Boost:BOOL=OFF \
-DTrilinos_ENABLE_Epetra:BOOL=OFF \
-DTrilinos_ENABLE_Kokkos:BOOL=ON \
-DKokkos_ENABLE_TESTS:BOOL=ON \
-DTrilinos_ENABLE_Tpetra:BOOL=ON \
-DTrilinos_ENABLE_ML:BOOL=OFF \
-DTrilinos_ENABLE_MueLu:BOOL=ON \
-DTrilinos_ENABLE_Stratimikos:BOOL=OFF \
-DTrilinos_ENABLE_Thyra:BOOL=OFF \
-DTrilinos_ENABLE_EpetraExt:BOOL=OFF \
-DTrilinos_ENABLE_AztecOO:BOOL=OFF \
-DTrilinos_ENABLE_Belos:BOOL=ON \
-DTrilinos_ENABLE_Ifpack2:BOOL=ON \
-DTrilinos_ENABLE_Amesos2:BOOL=ON \
-DTrilinos_ENABLE_Zoltan2:BOOL=ON \
-DTrilinos_ENABLE_Ifpack:BOOL=OFF \
-DTrilinos_ENABLE_Amesos:BOOL=OFF \
-DTrilinos_ENABLE_Zoltan:BOOL=ON \
-DTrilinos_ENABLE_STKMesh:BOOL=ON \
-DTrilinos_ENABLE_STKSimd:BOOL=ON \
-DTrilinos_ENABLE_STKIO:BOOL=OFF \
-DTrilinos_ENABLE_STKTransfer:BOOL=ON \
-DTrilinos_ENABLE_STKSearch:BOOL=ON \
-DTrilinos_ENABLE_STKUtil:BOOL=ON \
-DTrilinos_ENABLE_STKTopology:BOOL=ON \
-DTrilinos_ENABLE_STKBalance:BOOL=OFF \
-DTrilinos_ENABLE_STKUnit_tests:BOOL=OFF \
-DTrilinos_ENABLE_STKUnit_test_utils:BOOL=OFF \
-DTrilinos_ENABLE_Gtest:BOOL=ON \
-DKokkos_ENABLE_ATOMICS_BYPASS=ON \
-DTPL_ENABLE_Netcdf:BOOL=OFF \
-DTPL_BLAS_LIBRARIES=${blas_install_lib} \
-DTPL_LAPACK_LIBRARIES=${lapack_install_lib} \
$EXTRA_ARGS \
$TRILINOS_DIR
# build kokkos and tests
cd packages/kokkos
make -j16 Would you be able to test the configuration above in manual build on the machine where you see the issue? |
@ndellingwood, Let's take the build error during my bisect finding offline, or add a new ticket so that this particular ticket can focus on apps using "views of views". It turns out, I was able to locate the offending code in the failing unit tests, @alanw0 may have more insight. It does not look like our core Nalu assembly has this issue. The first hang occurs at:
|
Hmm, that's not a view-of-views, but it is a view allocation which is probably happening within a Kokkos::parallel_for. It turns out that is not legal even in Kokkos::Serial. I can probably help fix this. |
The T/F team pointed to me to this: https://kokkos.org/kokkos-core-wiki/ProgrammingGuide/View.html#can-i-make-a-view-of-views |
Are you sure about this not being a view of a view? rhs_ is a Kokkos::View<double*>. Why do we not simply use this view itself? |
I did manage to build and use the vov debugger library. But it's throwing an error at a location prior to the one causing the hang. I'm assuming that will have to be fixed as well. I'm behind the curve on this one because I'm not a kokkos programmer. I'm acting as the intermediary, since the person with actual knowledge of kokkos and gemma aren't on github. Anyway, with
|
@spdomin echoing @alanw0 , the culprit may be a View construction called within a parallel_for (not a view-of-views) triggering an allocation in a parallel region which can deadlock. If the function in the code snip above is called within a parallel_*, that could be the issue. It looks like |
@ndellingwood, @alanw0 and I will look into the fix... I think we were being lax within the unit test matrix assembly procedure and should be able to resolve this quickly. Thank you for the v_of_v example - it helped my understanding. Again, apologies for doubling up on this ticket with the Nalu-specific issue. Best of luck with GEMMA fix. I will certainly keep track to learn more about how others are using Kokkos in apps. |
@spdomin let me know how it goes, either on ticket or offline. In case useful, another thought came to mind was if you can decouple the
call |
@glhenni excellent, thanks for posting the output, this line:
points to the parallel_* call where a deallocation of a View is attempted, though the View isn't labeled so it will take a bit of checking. I'll contact you offline to see how best I can try to help more |
Bug Report
@crtrott it seems that commit a5eb4d4 is causing our application, GEMMA, to hang in the latter portions of the simulation. All we have to offer for diagnosing the problem so far is the stack trace below, obtained from interrupting the code while in the debugger and run through c++filt. Any suggestions on how to find the problem?
The text was updated successfully, but these errors were encountered: