-
Notifications
You must be signed in to change notification settings - Fork 0
meeting GPU support (2024 05 27)
Caspar van Leeuwen edited this page May 31, 2024
·
2 revisions
Present: Bob, Jurij, Davide, Thomas, Lara, Richard, Pedro, Kenneth, Julián
- "same software everywhere" promise is not feasible anymore when accelerators come into view
- combinatorics issue: can't really provide optimized installations for all CPU targets + all generations of NVIDIA GPUs
- see CUDA compute capabilities: https://developer.nvidia.com/cuda-gpus
- think CUDA + RISC-V
- combinatorics issue: can't really provide optimized installations for all CPU targets + all generations of NVIDIA GPUs
- could we use dev.eessi.io?
- benefits: we understand better how to support developers; we can be more flexible with adding packages (& removing them); probably good to adjust workflow + develop what is needed with the bot ...
- For development, sure but we would still need a deployment repo
- does a separate repo make things more complicated for the end users?
- Kenneth: yes, may hurt adoption/discovery
- do we actually have a "same software everywhere" guarantee on the main repo?
- Can we craft something to allow us to keep that for accelerators?
- Now maybe, but what about RISCV?
- Can we craft something to allow us to keep that for accelerators?
- fat builds vs compute capability specific builds
- directory structure for this
- currently we have
EESSI_SOFTWARE_SUBDIR/software EESSI_SOFTWARE_SUBDIR/modules/all
- NEW for GPU-enhanced software
-
EESSI_SOFTWARE_SUBDIR/nvidia_software/compute_XY
(binaries may still be built for specific CPU families+microarchitectures & are not fat)- Path could be shortened
.../nvidia/cc_XY
- Path could be shortened
-
- NEW for GPU-enhanced modules
-
EESSI_SOFTWARE_SUBDIR/nvidia_modules/compute_XY/modules/all
(allows us to detect the compute capability and then use the best fitting available set of modules)
-
- currently we have
- how "complete" should the coverage be?
- starting from compute_60 (P100) or compute_70 (V100) would support many GPUs (including consumer RTX cards)
- but this is less than what is actually covered by CUDA 12
- what about combinatorics, you also need CPU coverage
- PROPOSAL
- full compute capability support for
generic
(untested) - selected CPU and CC combinations based on ability to find test locations
- Need to cover specific combinations found in EuroHPC for production GPUs
- full compute capability support for
- PROPOSAL
- starting from compute_60 (P100) or compute_70 (V100) would support many GPUs (including consumer RTX cards)
- fat builds could be placed in the standard software/module directories (and "shadowed" by those that are arch-specific)
- could the downside of fat builds be that they don't work so well with CVMFS, i.e., (much) large(\r) binaries have to be fetched and cached?
- not all software supports fat builds
- unnecessary duplication
- applications appear on architectures where they are not supported
- fat builds are not possible for all applications
- interesting overview https://arnon.dk/matching-sm-architectures-arch-and-gencode-for-various-nvidia-cards/
- you compile for arch and can also support multiple compute capabilities
- only single arch supported without JIT compilation
- big fat binaries
- More on understanding CUDA fat binaries & JIT https://developer.nvidia.com/blog/cuda-pro-tip-understand-fat-binaries-jit-caching/
- you compile for arch and can also support multiple compute capabilities
- directory structure for this
- start with GROMACS as initial GPU app?
- Work has already been done for PyTorch (by Thomas)
- Alphafold next perhaps (requires TensorFlow)
- Both unlikely to throw up issues but CUDA-aware MPI only supported via UCX
- initial support for ROCm
- See https://github.com/easybuilders/easybuild-easyconfigs/pull/19591
- PR for 5.6.0 with GCC 11.3
- Current release is 6.1.1
- No ROCm toolchain defined
- Toolchain is arch-specific (and currently we don't even have a Clang toolchain)
- See https://github.com/easybuilders/easybuild-easyconfigs/pull/19591
- need to figure out device support
- could be much simpler than CUDA as you actually build the drivers and make our linker aware of them
- not clear how update
- could be much simpler than CUDA as you actually build the drivers and make our linker aware of them
- contacts within AMD via Kenneth (Joe Landman, George Markomanolis, ...)
- Relevant paper -
- OpenMPI vs MPICH (vs MVAPICH2 ... but this has same ABI as MPICH)
- WM4MPI?
- Used in containers for E4S
- Only currently supports
MPIX_Query_cuda_support
, notMPIX_Query_hip_support
orMPIX_Query_ze_support
(both MPICH) orMPIX_Query_rocm_support
(OpenMPI) - Unsure how easy it is to replace MPI libraries
- MPItrampoline is a good model for our use case
- Allows us to ship an arbitrary spectrum of MPI builds
- Everything builds against
MPItrampoline
which includes a default MPI backend (e.g. OpenMPI without GPU). Additonal backends are shipped viaMPIwrapper
and are enabled by setting an environment variable (making it really easy for a site to inject their preferred MPI library via Lmod). - Not perfect, works fine for C++, C but not all Fortran codes (see https://github.com/eschnett/MPItrampoline/issues/27)
- Open PR for proposed MPI ABI (which should also help solve Fortran issues) https://github.com/eschnett/MPItrampoline/pull/43
- which CPU+GPU targets should we build for + directory structure
- generic GPU: lowest common denominator (LCD) like CUDA CC 6.0
- examples:
-
software/x86_64/generic/accel/nvidia/cc60
(generic CPU + NVIDIA P100 or newer) -
software/x86_64/amd/zen2/accel/nvidia/cc80
(Vega: AMD Rome + A100)-
software/x86_64/amd/zen2/{software,modules/all}
# CPU -
software/x86_64/amd/zen2/accel/nvidia/cc80/{software,modules/all}
# A100
-
-
software/x86_64/amd/zen2/accel/nvidia/cc80
(Vega: AMD Rome + A100 - compute capability 8.0) -
software/x86_64/amd/zen3/accel/amd/gfx90a
(LUMI: AMD Milan + MI250X - LLVM targetgfx90a
) -
software/x86_64/amd/zen3/accel/intel/xxx
(example Intel XE system) -
software/aarch64/a64fx/accel/amd/gfx90a
(Deucalion: A64FX + A100)
-
- fat builds
- alongside CC-specific builds?
- complicates structure
- JSC is not in favor of fat builds (comes with a cost)
- doesn't work for all software (like LAMMPS)
- should keep system architectures of EuroHPC systems in mind
- example: Vega: AMD Rome (zen2) + NVIDIA A100
- we should be a bit more careful when making changes to scripts, like container script
- "move fast and break things" vs being very careful not to break anything
- go with "reasonable effort" to not break things, deal with fallout when we do break things
- more CI for
eessi_container.sh
script