Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Nightly Trilinos failure with Cuda/11.2.2 non-UVM builds, MueLu, Panzer unit tests #2015

Closed
ndellingwood opened this issue Oct 24, 2023 · 7 comments

Comments

@ndellingwood
Copy link
Contributor

ndellingwood commented Oct 24, 2023

Nightly cuda/11.2.2 builds (no UVM) are failing in the following unit tests with kokkos-kernels@develop:

03:05:58 The following tests FAILED:
03:05:58 	1784 - MueLu_UnitTestsBlockedTpetra_MPI_1 (Failed)
03:05:58 	1785 - MueLu_UnitTestsBlockedTpetra_MPI_4 (Failed)
03:05:58 	1838 - MueLu_MeshTyingBlocked_SimpleSmoother_MPI_4 (Failed)
03:05:58 	1841 - MueLu_MeshTyingBlocked_SimpleSmoother_2dof_medium_MPI_4 (Failed)
03:05:58 	2286 - PanzerAdaptersSTK_tDOFManager2_SimpleTests_MPI_4 (Failed)
03:05:58 	2371 - PanzerMiniEM_MiniEM-BlockPrec_RefMaxwell_reuse_MPI_4 (Failed)
03:05:58 	2372 - PanzerMiniEM_MiniEM-BlockPrec_RefMaxwell2D_MPI_4 (Failed)
03:05:58 	2373 - PanzerMiniEM_MiniEM-BlockPrec_MueLu_highOrder_0_MPI_4 (Failed)

https://jenkins-son.sandia.gov/job/KokkosEco_Trilinos_Weaver_CUDA112_opt-no-uvm/257

The PanzerMiniEM_MiniEM-BlockPrec_MueLu_highOrder_0_MPI_4 was previously reported in #2010 and is failing with release-candidate-4.2.00 as well. The other tests began failing after merge of the following commit:

Sparse: fix cusparse spgemm hang properly (detail)
Sparse: fix logic for bad cursparse spgemm version. (detail)
Improvements on the unification attempt logic for axpby(), including new tests (detail)
Addressing feedbacks from Luc, plus some small changes here and there: (detail)
Formatting (detail)
Using 'ifdef HAVE_KOKKOSKERNELS_DEBUG', per Luc's suggestion (detail)
Addressing feedbacks from Luc (detail)
Correcting compilation errors in my Mac (detail)
Backup (detail)
CUDA 11.0.1 / cuSPARSE 11.0.0 changed SpMM enums (detail)
CUDA 11.2.1 / cuSPARSE 11.4.0 changed SpMV (detail)

Reproducer (weaver rhel8):

# Repos
git clone -b kokkos-promotion https://github.com/trilinos/Trilinos.git
git clone -b develop https://github.com/kokkos/kokkos.git
git clone -b develop https://github.com/kokkos/kokkos-kernels.git

# Symbolic link to external kokkos and kokkos-kernels repos in Trilinos source directory for source override
cd Trilinos
ln -s <path-to-your-repo>/kokkos kokkos
ln -s <path-to-your-repo>/kokkos-kernels kokkos-kernels
cd ..

# Create build and local tmp directories
mkdir -p build
cd build

export TEMPDIR=$PWD/tmp_cuda
export TMPDIR=$TEMPDIR
mkdir -p $TMPDIR

# Interactive node
bsub -Is -n 1 -q rhel8 -gpu "num=4" bash

# Environment setup
source /etc/profile.d/modules.sh
module purge
source /projects/ppc64le-pwr9-rhel8/legacy-env.sh

export ATDM_CONFIG_REGISTER_CUSTOM_CONFIG_DIR=${TRILINOS_DIR}/cmake/std/atdm/contributed/weaver
source ${TRILINOS_DIR}/cmake/std/atdm/load-env.sh weaver-cuda-11.2-opt
export OMPI_CXX="$KOKKOS_DIR/bin/nvcc_wrapper"

# Cmake config
cmake \
      -D CMAKE_CXX_FLAGS='-g' \
      -D CMAKE_CXX_STANDARD="17" \
      -D CMAKE_INSTALL_PREFIX=$PWD/install \
      -D Trilinos_CONFIGURE_OPTIONS_FILE:STRING=cmake/std/atdm/ATDMDevEnv.cmake \
      -DTrilinos_ENABLE_TESTS=OFF \
      -DTrilinos_ENABLE_ALL_PACKAGES=OFF \
      -D Trilinos_ENABLE_Kokkos=ON \
      -D Kokkos_ARCH_VOLTA70=ON \
      -D Kokkos_ARCH_POWER9=ON \
      -D Kokkos_ENABLE_CUDA=ON \
      -D Kokkos_ENABLE_CUDA_LAMBDA=ON \
      -D Kokkos_ENABLE_CUDA_UVM=OFF \
      -D Kokkos_ENABLE_IMPL_CUDA_MALLOC_ASYNC=OFF \
      -DTrilinos_ENABLE_Stokhos=ON \
      -D TPL_ENABLE_CUSPARSE:BOOL=ON \
      -DTrilinos_ENABLE_COMPLEX_DOUBLE=ON \
      -DTrilinos_ENABLE_MueLu=ON \
      -D MueLu_ENABLE_TESTS=ON \
      -DTrilinos_ENABLE_Panzer=ON \
      -D Panzer_ENABLE_TESTS=ON \
      -D Panzer_ENABLE_EXAMPLES=ON \
      -DKokkos_SOURCE_DIR_OVERRIDE:STRING=kokkos \
      -DKokkosKernels_SOURCE_DIR_OVERRIDE:STRING=kokkos-kernels \
$TRILINOS_DIR

# Build
make -j16

# Failing test
ctest
@lucbv
Copy link
Contributor

lucbv commented Oct 24, 2023

Blocked seems to be a theme in the failing unit-tests but I'm not sure these are the small blocks of a BsrMatrix.

@ndellingwood
Copy link
Contributor Author

ndellingwood commented Oct 24, 2023

PR's corresponding to the commit list:

@lucbv since the failures are block-related, I'll start triage with a revert of #2008 to see the impact on the tests. MueLu builds take awhile with cuda, so it'll be awhile before I have the breaking change pinpointed

@ndellingwood
Copy link
Contributor Author

ndellingwood commented Oct 24, 2023

@lucbv revert of #2008 did not resolve the MueLu failures. Rebuilding with a revert of #2011 to retest

@lucbv
Copy link
Contributor

lucbv commented Oct 24, 2023

Okay, I'm glad #2008 did not generate the issues but unfortunately that will take you a little longer to get to the bottom of it.
Except for #1895 the other 3 PRs are fairly light in terms of changes so if they trigger the problem it should still be easy to fix.

@ndellingwood
Copy link
Contributor Author

Revert of #2011 and #2012 did not help with the MueLu tests, they still failed. Rebuilding with revert of #1895

@ndellingwood
Copy link
Contributor Author

Revert of #1895 returned MueLu tests to passing

@ndellingwood
Copy link
Contributor Author

Addressed by #2039, thanks @eeprude !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants