Skip to content

meeting GPU support (2024 05 27)

Caspar van Leeuwen edited this page May 31, 2024 · 2 revisions

Discussion on GPU support (20240527)

Present: Bob, Jurij, Davide, Thomas, Lara, Richard, Pedro, Kenneth, Julián

In software.eessi.io or separate repo?

  • "same software everywhere" promise is not feasible anymore when accelerators come into view
    • combinatorics issue: can't really provide optimized installations for all CPU targets + all generations of NVIDIA GPUs
    • think CUDA + RISC-V
  • could we use dev.eessi.io?
    • benefits: we understand better how to support developers; we can be more flexible with adding packages (& removing them); probably good to adjust workflow + develop what is needed with the bot ...
    • For development, sure but we would still need a deployment repo
  • does a separate repo make things more complicated for the end users?
    • Kenneth: yes, may hurt adoption/discovery
  • do we actually have a "same software everywhere" guarantee on the main repo?
    • Can we craft something to allow us to keep that for accelerators?
      • Now maybe, but what about RISCV?

NVIDIA GPUs

  • fat builds vs compute capability specific builds
    • directory structure for this
      • currently we have
        EESSI_SOFTWARE_SUBDIR/software
        EESSI_SOFTWARE_SUBDIR/modules/all
        
      • NEW for GPU-enhanced software
        • EESSI_SOFTWARE_SUBDIR/nvidia_software/compute_XY (binaries may still be built for specific CPU families+microarchitectures & are not fat)
          • Path could be shortened .../nvidia/cc_XY
      • NEW for GPU-enhanced modules
        • EESSI_SOFTWARE_SUBDIR/nvidia_modules/compute_XY/modules/all (allows us to detect the compute capability and then use the best fitting available set of modules)
    • how "complete" should the coverage be?
      • starting from compute_60 (P100) or compute_70 (V100) would support many GPUs (including consumer RTX cards)
        • but this is less than what is actually covered by CUDA 12
      • what about combinatorics, you also need CPU coverage
        • PROPOSAL
          • full compute capability support for generic (untested)
          • selected CPU and CC combinations based on ability to find test locations
            • Need to cover specific combinations found in EuroHPC for production GPUs
    • fat builds could be placed in the standard software/module directories (and "shadowed" by those that are arch-specific)
      • could the downside of fat builds be that they don't work so well with CVMFS, i.e., (much) large(\r) binaries have to be fetched and cached?
      • not all software supports fat builds
      • unnecessary duplication
      • applications appear on architectures where they are not supported
    • fat builds are not possible for all applications
    • interesting overview https://arnon.dk/matching-sm-architectures-arch-and-gencode-for-various-nvidia-cards/
  • start with GROMACS as initial GPU app?
    • Work has already been done for PyTorch (by Thomas)
    • Alphafold next perhaps (requires TensorFlow)
    • Both unlikely to throw up issues but CUDA-aware MPI only supported via UCX

AMD GPUs

  • initial support for ROCm
  • need to figure out device support
    • could be much simpler than CUDA as you actually build the drivers and make our linker aware of them
      • not clear how update
  • contacts within AMD via Kenneth (Joe Landman, George Markomanolis, ...)

Accelerator-aware MPI

  • Relevant paper -
  • OpenMPI vs MPICH (vs MVAPICH2 ... but this has same ABI as MPICH)
  • WM4MPI?
    • Used in containers for E4S
    • Only currently supports MPIX_Query_cuda_support, notMPIX_Query_hip_support or MPIX_Query_ze_support (both MPICH) or MPIX_Query_rocm_support (OpenMPI)
    • Unsure how easy it is to replace MPI libraries
  • MPItrampoline is a good model for our use case
    • Allows us to ship an arbitrary spectrum of MPI builds
    • Everything builds against MPItrampoline which includes a default MPI backend (e.g. OpenMPI without GPU). Additonal backends are shipped via MPIwrapper and are enabled by setting an environment variable (making it really easy for a site to inject their preferred MPI library via Lmod).
    • Not perfect, works fine for C++, C but not all Fortran codes (see https://github.com/eschnett/MPItrampoline/issues/27)

Notes

  • which CPU+GPU targets should we build for + directory structure
    • generic GPU: lowest common denominator (LCD) like CUDA CC 6.0
    • examples:
      • software/x86_64/generic/accel/nvidia/cc60 (generic CPU + NVIDIA P100 or newer)
      • software/x86_64/amd/zen2/accel/nvidia/cc80 (Vega: AMD Rome + A100)
        • software/x86_64/amd/zen2/{software,modules/all} # CPU
        • software/x86_64/amd/zen2/accel/nvidia/cc80/{software,modules/all} # A100
      • software/x86_64/amd/zen2/accel/nvidia/cc80 (Vega: AMD Rome + A100 - compute capability 8.0)
      • software/x86_64/amd/zen3/accel/amd/gfx90a (LUMI: AMD Milan + MI250X - LLVM target gfx90a)
      • software/x86_64/amd/zen3/accel/intel/xxx (example Intel XE system)
      • software/aarch64/a64fx/accel/amd/gfx90a (Deucalion: A64FX + A100)
  • fat builds
    • alongside CC-specific builds?
    • complicates structure
    • JSC is not in favor of fat builds (comes with a cost)
    • doesn't work for all software (like LAMMPS)
  • should keep system architectures of EuroHPC systems in mind
    • example: Vega: AMD Rome (zen2) + NVIDIA A100
  • we should be a bit more careful when making changes to scripts, like container script
    • "move fast and break things" vs being very careful not to break anything
    • go with "reasonable effort" to not break things, deal with fallout when we do break things
    • more CI for eessi_container.sh script
Clone this wiki locally