EESSI hackathon Dec'21

EESSI hackathon (Dec'21)

when: week of Dec 13-17 2021
main goal: focused effort on various tasks in EESSI
expectations:
- joining kickoff/sync/show&tell meetings
- spending a couple of hours that week on one or more of the outlined tasks (in group)
- take extensive notes (to integrate into documentation later)
registration: https://doodle.com/poll/xha7h6pawwuk5xc2

Links

original list of potential tasks
GitHub repo for EESSI hackathon(s): https://github.com/EESSI/hackathons

Meetings

Mon Dec 13th 2021, 09:00 UTC: kickoff
- clarify expectations
- overview of tasks
- getting organised: who works on what, form groups
Wed Dec 15th 2021, 09:00 UTC: sync
- sync meeting notes:
- quick progress report per group
- briefly discuss next steps
- notes: (see below)
Fri Dec 17th 2021, 13:00 UTC: show & tell
- each group briefly demos/presents what they worked on
- outline follow-up steps
- slides: https://raw.githubusercontent.com/EESSI/meetings/main/meetings/EESSI_hackathon_2021-12_show_and_tell.pdf
- recording: https://www.youtube.com/watch?v=H6Wx6hAO-r0

Attendees

If you plan to actively participate in this hackathon:

add your name + affiliation + GitHub handle below (or ask someone to do it for you)
feel free to pick ONE task you would like to work on, add your name to the list for that task (see people working on this)

Joining:

Kenneth Hoste (HPC-UGent) - @boegel
Thomas Röblitz (HPC-UBergen) - @trz42
Bob Dröge (HPC-UGroningen) - @bedroge
Ward Poelmans (VUB-HPC) - @wpoely86
Jurij Pečar (EMBL) - @jpecar
Martin Errenst (HPC.NRW / University of Wuppertal) - @stderr-enst
Axel Rosén (HPC-UOslo) - @rungitta
Terje Kvernes (UOslo) - @terjekv
Alan O'Cais (CECAM) - @ocaisa
Caspar van Leeuwen (SURF) - @casparvl
Ahmad Hesam (SURF)
Michael Hübner (UBonn - HPC.NRW)
Erica Bianco (HPCNow!) - @kErica
Hugo Meiland (Microsoft Azure) - @hmeiland
Jörg Saßmannshausen (NHS/GSTT) - @sassy-crick

Available infrastructure

Please use the virtual clusters we have set up for this hackathon!

EESSI pilot repository is readily available
Different CPU types supported
Singularity is installed

Magic Castle cluster in AWS

managed by Alan
all info at https://github.com/EESSI/hackathons/tree/main/2021-12/magic_castle

Cluster-in-the-Cloud in AWS

managed by Kenneth
all info at https://github.com/EESSI/hackathons/tree/main/2021-12/citc

Communication

If you need help, contact us via the EESSI Slack (join via https://www.eessi-hpc.org/join)

General hackathon channel: #hackathon.

See also task-specific channels!

Selected tasks & task teams

Based on the doodle, a subset of main tasks was selected for this hackathon:

[02] Installing software on top of EESSI
- task lead: Kenneth
- participating: Kenneth, Erica, Martin, (Ahmad)
- notes: (see below)
- Slack channel: #hackathon-software_on_top
- Zoom: https://uib.zoom.us/j/65344277321?pwd=THVPY3hZQmlRa0loOWd6b2xKaFRrZz09
[03] Workflow to propose additions to EESSI software stack
- task lead: Bob
- participating: Bob, Jörg (+ Kenneth)
- notes: (see below)
- Slack channel: #hackathon-contribution_workflow
- Zoom: https://uib.zoom.us/j/69823235860?pwd=UjRNYmV0UGoxSmdGMkZsclpBSGJZQT09
[05] GPU support
- task lead: Alan
- participating: Alan, Michael, Ward
- notes: (see below)
- Slack channel: #hackathon-gpu_support
- Zoom: https://uib.zoom.us/j/69890745932?pwd=bWlxV2prTyswS0Q4SWptMzA3bDVBQT09
[06] EESSI test suite
- task lead: Caspar
- participating: Caspar, Vasileois, Thomas, Hugo, (Bob)
- notes: (see below)
- Slack channel: #hackathon-test_suite
- Zoom: https://uib.zoom.us/j/63178835002?pwd=SnUzTmFpcmlhS0VueWRwM2RicGtBdz09
[16] Export a version of the EESSI stack to a tarball and/or container image
- task lead: Jure
- participating: Jure
- notes: (see below)
- Slack channel: #hackathon-export_software_stack

Lone wolves:

Axel + Terje: monitoring
- notes: (see below)
- Zoom: https://uib.zoom.us/j/61135526605?pwd=VkZnRXhMVTI1RkIxTis2Vm4yUkRtQT09
Ahmad + Axel: private Stratum-1
- notes: (see below)
- Zoom: https://uib.zoom.us/j/62100150823?pwd=OFNpWk9RZ3llTWdqZ0VId1VUMG03UT09
Hugo (+ Matt): Azure support in CitC

Sync meeting notes (2021-12-15)

Task progress:

[02] Installing software on top of EESSI

task notes: (see below)
executive summary:
- great progress by Martin on including RPATH wrappers into GCCcore installation in EESSI to facilitate building software manually on top of EESSI
- Hugo got WRF installed on top of EESSI using EasyBuild
- TODO:
  - Figure out best way to add support to GCC easyblock to opt-in to also installing RPATH wrappers
  - Check on interference between included RPATH wrappers and the dynamic ones set up by EasyBuild
  - Documentation on installing software on top of EESSI
  - Fully autonomous build script (in Prefix env + build container, etc.)
    - use stdin trick to run stuff in Prefix env

/.../startprefix <<<
...

[03] Workflow to propose additions to EESSI software stack

task notes: (see below)
executive summary:
- initial planning + implementation done
- GitHub App/bot is being developed in https://github.com/EESSI/eessi-bot-software-layer
- bot can already react to opening of PR
- support was added to replay events (to facilitate testing)
- Jörg's build container/script can be used in "backend" of bot
- test PR: https://github.com/EESSI/hackathons/pull/2/files
- Is the bot used by CVMFS public?
  - cfr. https://github.com/cvmfs/cvmfs/pull/2829
- How will distributed resources be used?
  - One bot that talks to other resources
  - Multiple bots negotiating
- Bot should report back results of build/test (especially in terms of failure)

[05] GPU support

task notes: (see below)
executive summary:
- ...
- would be useful to have more recent toolchains installed in 2021.12 (for AMD Rome)

[06] EESSI test suite

task notes: (see below)
executive summary:
- ReFrame intro by Vasileios
- Created list of compat & software layer tests needed (see task notes)
- Setup ReFrame 3.9.2 (on top of EESSI! :) ) on Magic Castle
- Made GROMACS EESSI test (written on top of CSCS GROMACS libtest) work on Magic Castle
  - Only works from Jupyter terminal because of SELinux, but should be resolved in the future (see https://github.com/ComputeCanada/puppet-magic_castle/issues/163)
  - Default mem on Magic Castle jobs limited => Need to add memory request / requirements to GROMACS EESSI test
- Looking into compat tests (https://github.com/EESSI/compatibility-layer/blob/main/test/compat_layer.py)
  - so far mostly trying to understand ReFrame, rerunning tests, ...
- Started WRF test, working on download benchmark and prepping rundir
- separate CVMFS repo in EESSI for shipping large data files (benchmark inputs)?
- ReFrame: don't copy large files?
  - https://reframe-hpc.readthedocs.io/en/stable/regression_test_api.html?highlight=resourcesdir#reframe.core.systems.System.resourcesdir
- CVMFS: dealing with large files?

[16] Export a version of the EESSI stack to a tarball and/or container image

task notes: (see below)
executive summary:
- see detailed notes
- copy takes time, needed variant symlinks not yet in place
- alternative approach could be a separate archive repo + docs to use it

[XX] Monitoring

task notes: (see below)
executive summary:
- see notes
- Hugo: CVMFS through Azure blob will make monitoring more challenging
  - also applies to updating the repo contents

[XX] Private Stratum-1

task notes: (see below)
executive summary:
- see notes
- most work done by Ahmad, testing by Alex
- test suite to verify Stratum-1? => https://github.com/EESSI/filesystem-layer/issues/111

[XX] Azure support in CitC

task notes: ???
executive summary:
- WIP
credit status in AWS:
- ~~$110 on Mon (~~$25 GPU node, ~$45 EFA nodes in Magic Castle)
- ~~$130 on Tue (~~$40 GPU node, ~$45 EFA nodes in Magic Castle, ~$20 CitC nodes)

EESSI hackathon Dec'21 - installing software on top of EESSI

Zoom

https://uib.zoom.us/j/65344277321?pwd=THVPY3hZQmlRa0loOWd6b2xKaFRrZz09

Docs

separate note: working towards user-facing docs: https://hackmd.io/irkuPm4BSye6OL24wKmpgw

Goals

building with GCC included in EESSI -- For notes, see section below
Have PythonPackage to work on top of EESSI (with EasyBuild)
- have a simple EB recipe working on EESSI out of the box (configuring EB correctly)
  - start prefix and then configure correctly EB
    - /cvmfs/pilot.eessi-hpc.org/2021.06/compat/linux/$(uname -m)/startprefix
  - https://github.com/EESSI/software-layer/blob/main/configure_easybuild
- some documentation on it
Hugo: building WRF on top of EESSI (with EasyBuild)
standalone script to install software on top of EESSI

startprefix <<<
source configure_easybuild
eb example.eb

Building with GCC included in EESSI

module load GCC should set up RPATH wrappers, such that gcc and other tools do the right thing with respect to RPATH
- Should be transparent to users for simple builds
- Users should still be aware of the issue => Documentation needed
Development workflow:
- module load GCC with unchanged module
- Write python script that generates rpath wrappers from easybuild framework functions
- Put resulting script in PATH to replace gcc and other commands and forward to correct commands within script
- If everything works, figure out how to create & ship wrappers on install step of GCC
Look into prepare_rpath_wrapper from easybuild-framework
- https://github.com/easybuilders/easybuild-framework/blob/develop/easybuild/tools/toolchain/toolchain.py#L947
- https://github.com/easybuilders/easybuild-framework/blob/develop/easybuild/scripts/rpath_args.py
What to do with filters and include paths?
- Make use of environment variables
- Environment variables could be set during easybuild installation step
- Env-Variables should be discussed in documentation if users want to exclude/inject certain libraries in their build process
Just for reference, ComputeCanada provide a script to patch binaries: https://github.com/ComputeCanada/easybuild-computecanada-config/blob/main/setrpaths.sh
- And they ship an ld-wrapper with their compatibility layer: https://github.com/ComputeCanada/gentoo-overlay/blob/8fdb45ba676a5fbb19f165bd85a9c82470218753/sys-devel/binutils-config/files/ld-wrapper.sh
First script available
- Assumes source /cvmfs/pilot.eessi-hpc.org/2021.06/init/bash and module load EasyBuild/4.4.1
- Produces wrapper scripts in /tmp/eb-.../tmp...../rpath_wrappers/{gcc,gxx,gfortran,ld.bfd,ld.gold,ld}_wrapper
- Output looks like this: export PATH=/tmp/eb-d3kwvfit/tmpg3zkeuoc/rpath_wrappers/ld.bfd_wrapper:<other wrapper paths + original $PATH>
- You can set the environment variables $RPATH_FILTER_DIRS and $RPATH_INCLUDE_DIRS to exclude/include certain paths as RPATHs. Right now the variables are expected to be comma-separated lists. Should we change it to ':' or configurable separator?
- Compiling MCFM as a test project. readelf -d lists RPATHs and program seems to work.
- Compiling a hello world example with and without this modified PATH gives:

$ readelf -d hello_world_* | grep rpath
File: hello_world_norpath
File: hello_world_rpath
 0x000000000000000f (RPATH)              Library rpath: [/cvmfs/pilot.eessi-hpc.org/2021.06/software/linux/x86_64/intel/haswell/software/GCCcore/9.3.0/lib/gcc/x86_64-pc-linux-gnu/9.3.0:/cvmfs/pilot.eessi-hpc.org/2021.06/software/linux/x86_64/intel/haswell/software/GCCcore/9.3.0/lib/gcc/x86_64-pc-linux-gnu/9.3.0/../../../../lib64:/cvmfs/pilot.eessi-hpc.org/2021.06/compat/linux/x86_64/lib/../lib64:/cvmfs/pilot.eessi-hpc.org/2021.06/compat/linux/x86_64/usr/lib/../lib64:/cvmfs/pilot.eessi-hpc.org/2021.06/software/linux/x86_64/intel/haswell/software/GCCcore/9.3.0/lib/gcc/x86_64-pc-linux-gnu/9.3.0/../../..:/cvmfs/pilot.eessi-hpc.org/2021.06/compat/linux/x86_64/lib:/cvmfs/pilot.eessi-hpc.org/2021.06/compat/linux/x86_64/usr/lib]

Integrate rpath wrapper creating function into corresponding EasyBuild blocks
- Creates wrappers
- moves them to bin/rpath_wrappers in install location
- sets self.wrapperdir
- Function should be made more generic. Reusable for other compiler easyblocks as well, not just gcc.py
- Not an opt-in option yet, but used whenever build_option('rpath') is set
Setting PATH while loading GCCcore can just be done by adding guesses['PATH'].insert(0, self.wrapperdir) to gcc.py#make_module_req_guess()
Left to do (bold = WIP):
1. Refactor create_rpath_wrappers to a generic loaction
2. Integrate rpath wrappers as opt-in option install_rpath_wrappers = True in EasyBuild
3. Include environment variable DISABLE_RPATH_WRAPPERS in module file
4. How to handle ld wrappers when loading GCC module? How to do this in EasyBuild vs. how to do this in EESSI with binutils from compatibility layer?
5. Make sure rpath wrappers don't mess up loading GCC modules within EasyBuild
6. Ship similar feature for other popular compilers?
7. write documentation in EasyBuild RPATH, Compiler docu. (Also in module file itself?)
Continuing discussion here: EasyBuild PR 2638

Starting point for script to generate EasyBuild RPATH wrapper scripts

#!/usr/bin/env python
from easybuild.tools.options import set_up_configuration
set_up_configuration(silent=True)

from easybuild.tools.toolchain.toolchain import Toolchain

tc = Toolchain()
tc.prepare_rpath_wrappers([], [])

Building WRF

The following procedure will build correct rpath binaries for WRF

/cvmfs/pilot.eessi-hpc.org/2021.06/compat/linux/$(uname -m)/startprefix
source /cvmfs/pilot.eessi-hpc.org/2021.06/init/bash
ml load EasyBuild/4.4.1
export EASYBUILD_PREFIX=/project/def-sponsor00/easybuild
export EASYBUILD_IGNORE_OSDEPS=1
export EASYBUILD_SYSROOT=${EPREFIX}
export EASYBUILD_RPATH=1
export EASYBUILD_FILTER_ENV_VARS=LD_LIBRARY_PATH
export EASYBUILD_FILTER_DEPS=Autoconf,Automake,Autotools,binutils,bzip2,cURL,DBus,flex,gettext,gperf,help2man,intltool,libreadline,libtool,Lua,M4,makeinfo,ncurses,util-linux,XZ,zlib
export EASYBUILD_MODULE_EXTENSIONS=1
eb -S WRF
-> get correct CFGS1=/cvmfs/pilot.eessi-hpc.org/2021.06/software/linux/x86_64/intel/<arch>/software/EasyBuild/4.4.1/easybuild/easyconfigs
eb -r $CFGS1/w/WRF/WRF-3.9.1.1-foss-2020a-dmpar.eb

Standalone EESSI installation script

Goals:

configure easybuild properly
feed a script with easybuild recipe to install it on top of EESSI

Actual workflow

~~1. start EESSI environment~~ ~~source /cvmfs/pilot.eessi-hpc.org/2021.06/init/bash~~ ~~2. load EasyBuild~~ ~~ml EasyBuild/4.4.1~~ ~~3. install your recipe~~ ~~eb YOURAPP.eb~~

Needs:

check the different archs to build the app upon
read the recipe(s) and the patches related → git diff with the actual app list?
build the app(s) on the different archs, ideally in parallel

Example

# ARCHLIST = list of architectures
# REPO = where your recipes and patches are stored 
for arch in $ARCHLIST
do
   install_on_top.sh $REPO $ARCH
done

EESSI hackathon Dec'21 - workflow to propose additions to EESSI software stack

Zoom

https://uib.zoom.us/j/69823235860?pwd=UjRNYmV0UGoxSmdGMkZsclpBSGJZQT09

Bot account on CitC

[email protected]

Notes from brainstorm meeting

https://github.com/EESSI/meetings/wiki/Brainstorm-meeting-github-app-software-layer-Nov-26th-2021

Code repo for App code

https://github.com/EESSI/eessi-bot-software-layer

First steps

Go through brainstorm meeting notes [Everyone]
Set up the app on VM [Bob]
Make a very simple easystack example [Jörg]
Collect some event data [Bob]
- Can easily be done using our Smee URL: https://smee.io/7PIXBDoqczjEVXaf
Meet at 4pm CET

Dec 14 meeting

Tasks

Next functionality/steps to be implemented

On the (build/login) node where the app is running:
- Create unique working directory for the job (use event id?)
- Checkout the branch with the easystack file
- Submit the job
- Take and upload the log(s) in case of failures
(submit job to) do some eb run
- Install the apps from the easystack
- Run tests (eb --sanity-check-only)
- Make a tarball
- report back in PR
(submit job to) do test run
- Different OS
- Unpack tarball
- Re-run tests
- report back in PR

sbatch -C shape=c4.2xlarge => haswell

Running the app

# Clone the repo
git clone https://github.com/EESSI/eessi-bot-software-layer

# Run smee (in screen)
cd eessi-bot-software-layer
./smee.sh

# Run the app itself with a Python virtual environment
cd eessi-bot-software-layer
python3 -m venv venv
source venv/bin/activate
pip3 install -r requirements.txt
./run.sh

Authenticating the app

https://github.com/PyGithub/PyGithub/issues/1766#issuecomment-749519409

# Fill the required information
APP_ID = 
PRIVATE_TOKEN =  
INSTALLATION_ID = 

github_integration = github.GithubIntegration(APP_ID, PRIVATE_TOKEN)
# Note that installation access tokens last only for 1 hour, you will need to regenerate them after they expire.
access_token = github_integration.get_access_token(INSTALLATION_ID)

login = github.Github(access_token)

APP_ID can be found at: https://github.com/organizations/EESSI/settings/apps/eessi-bot-software-layer
PRIVATE_TOKEN is a private key that can be generated on the same page
INSTALLATION_ID can be found by going to this page, selecting the configuration button for the installed app, and copy it from the URL: https://github.com/organizations/EESSI/settings/apps/eessi-bot-software-layer/installations
- or: use github_integration.get_installation('EESSI', 'software-layer') (or some other repo to which the app is subscribed)

EESSI hackathon Dec'21 - GPU support

Zoom

https://uib.zoom.us/j/69890745932?pwd=bWlxV2prTyswS0Q4SWptMzA3bDVBQT09

Notes

CUDA cannot (currently) be distributed by EESSI!

Install compatiblity layers from CUDA to deal with CUDA/driver (mis)matching

Use host-injection for EESSI to get CUDA available to EESSI. Use symlink in EESSI that needs to point to correct CUDA path on local site. Need check to check it's actually working.

So EESSI needs to set up cuda on the local site.

Approach:

EESSI software expect CUDA to be at certain location (a broken symlink by default)
Host side should 'provide' the symlink location (host injection)
nvidia drivers need to be available on the host system (outside EESSI)
EESSI will set up CUDA on the host side (need writable path)
Lmod visible hook: only show CUDA dependant modules when CUDA works (maybe check variable EESSI_GPU_SUPPORT_ACTIVE=1?)
Use EasyBuild to install CUDA module on the host side.

Planning:

Alan will try to reproduce what he already did before but didn't document ;)
Ward has a solid block of available time on Thursday
Michael helps out wherever he can

Related issues on GitHub:

Progress

Prepare development environment

Working environment is on eessi-gpu.learnhpc.eu.

A shared space for installations at /project/def-sponsor00/easybuild is to be created. To behave similarly to the EESSI installation script, we need to drop into a Gentoo prefix shell with :

$EPREFIX/startprefix

The full EasyBuild environment used was

source /etc/profile.d/z-01-site.sh
export EASYBUILD_PREFIX=/project/def-sponsor00/easybuild
export EASYBUILD_IGNORE_OSDEPS=1
export EASYBUILD_SYSROOT=${EPREFIX}
export EASYBUILD_RPATH=1
export EASYBUILD_FILTER_ENV_VARS=LD_LIBRARY_PATH
export EASYBUILD_FILTER_DEPS=Autoconf,Automake,Autotools,binutils,bzip2,cURL,DBus,flex,gettext,gperf,help2man,intltool,libreadline,libtool,Lua,M4,makeinfo,ncurses,util-linux,XZ,zlib
export EASYBUILD_MODULE_EXTENSIONS=1
module load EasyBuild

At this point we can now install software with EasyBuild

Install CUDA

Nothing special here, standard installation with EasyBuild:

eb CUDAcore-11.3.1.eb

Once installed, we need to make the module available:

module use /project/def-sponsor00/easybuild/modules/all/

Version of CUDA not supported by driver

We just installed CUDA 11.3 but we check the CUDA version supported by our driver:

[ocaisa@gpunode1 ocaisa]$ nvidia-smi 
Mon Dec 13 13:57:00 2021       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.73.01    Driver Version: 460.73.01    CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  GRID V100-4C        On   | 00000000:00:05.0 Off |                  N/A |
| N/A   N/A    P0    N/A /  N/A |    304MiB /  4096MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

Installing the CUDA compatability drivers

We do not (necessarily) need to update our drivers to use the latest CUDA. NVIDIA have long term support drivers (R450 and R470 for now) with which you can use CUDA compatability libraries to use the latest CUDA. Making these libaries findable by the compat layer is enough to give you a working CUDA.

For details see https://docs.nvidia.com/datacenter/tesla/drivers/ (and specifically https://docs.nvidia.com/datacenter/tesla/drivers/#cuda-drivers) where they say A production branch that will be supported and maintained for a much longer time than a normal production branch is supported. Every LTSB is a production branch, but not every production branch is an LTSB.. We can parse https://docs.nvidia.com/datacenter/tesla/drivers/releases.json to figure out the LTS branches (and whether someone should upgrade).

At any point in time, it is best to install the latest version of the compat libraries (since these will track the driver versions). To find the right compat libraries to install we need to be able to navigate https://developer.download.nvidia.com/compute/cuda/repos/, selecting the right OS and the latest version of the compat libraries.

Once we know this, we install the CUDA compatability libraries so that 11.3 will work with our driver version. Let's put the drivers in a place that is automatically found by the EESSI linker (/cvmfs/pilot.eessi-hpc.org/host_injections/2021.06/compat/linux/x86_64/lib) and set things up so we can universally upgrade to a later version of the compat libaries. /cvmfs/pilot.eessi-hpc.org/host_injections points to /opt/eessi by default and I have made that group writable on our cluster:

# Create a general space for our NVIDIA compat drivers
mkdir -p /cvmfs/pilot.eessi-hpc.org/host_injections/nvidia
cd /cvmfs/pilot.eessi-hpc.org/host_injections/nvidia

# Grab the latest compat library RPM
wget https://developer.download.nvidia.com/compute/cuda/repos/rhel8/x86_64/cuda-compat-11-5-495.29.05-1.x86_64.rpm

# Unpack it
rpm2cpio cuda-compat-11-5-495.29.05-1.x86_64.rpm | cpio -idmv
mv usr/local/cuda-11.5 .
rm -r usr

# Add a symlink that points to the latest version
ln -s cuda-11.5 latest

# Create the space to host the libraries
mkdir -p /cvmfs/pilot.eessi-hpc.org/host_injections/2021.06/compat/linux/x86_64
# Symlink in the path to the latest libraries
ln -s /cvmfs/pilot.eessi-hpc.org/host_injections/nvidia/latest/compat /cvmfs/pilot.eessi-hpc.org/host_injections/2021.06/compat/linux/x86_64/lib

Now we can again check the supported CUDA version:

[ocaisa@gpunode1 ~]$ nvidia-smi 
Mon Dec 13 14:06:15 2021       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.73.01    Driver Version: 460.73.01    CUDA Version: 11.5     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  GRID V100-4C        On   | 00000000:00:05.0 Off |                  N/A |
| N/A   N/A    P0    N/A /  N/A |    304MiB /  4096MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

Note that Compute Canada are considering putting the compatability libraries directly into their Gentoo Prefix compatability layer (see https://github.com/ComputeCanada/software-stack/issues/79). I'm not sure whether this can really be OS independent.

Build samples

Make a local copy of the CUDA examples:

module load CUDAcore
cp -r $EBROOTCUDACORE/samples ~/

Build the CUDA samples with GCC from EESSI as the host compiler:

module load GCC CUDAcore
cd ~/samples
make HOST_COMPILER=$(which g++)

Unfortunately this seems to fail for some samples:

make[1]: Leaving directory '/home/ocaisa/samples/7_CUDALibraries/conjugateGradientUM'
/cvmfs/pilot.eessi-hpc.org/2021.06/software/linux/x86_64/amd/zen2/software/GCCcore/9.3.0/lib/gcc/x86_64-pc-linux-gnu/9.3.0/../../../../lib64/libstdc++.so: error: undefined reference to 'fstat64', version 'GLIBC_2.33'
/cvmfs/pilot.eessi-hpc.org/2021.06/software/linux/x86_64/amd/zen2/software/GCCcore/9.3.0/lib/gcc/x86_64-pc-linux-gnu/9.3.0/../../../../lib64/libstdc++.so: error: undefined reference to 'stat', version 'GLIBC_2.33'
/cvmfs/pilot.eessi-hpc.org/2021.06/software/linux/x86_64/amd/zen2/software/GCCcore/9.3.0/lib/gcc/x86_64-pc-linux-gnu/9.3.0/../../../../lib64/libstdc++.so: error: undefined reference to 'lstat', version 'GLIBC_2.33'
collect2: error: ld returned 1 exit status
make[1]: *** [Makefile:363: simpleCUFFT_callback] Error 1
make[1]: Leaving directory '/home/ocaisa/samples/7_CUDALibraries/simpleCUFFT_callback'
make: *** [Makefile:51: 7_CUDALibraries/simpleCUFFT_callback/Makefile.ph_build] Error 2

This looks related to the compatability layer.

Test

Once built, we can test some of the resulting exectables:

[ocaisa@gpunode1 samples]$ ./bin/x86_64/linux/release/deviceQuery
./bin/x86_64/linux/release/deviceQuery Starting...

 CUDA Device Query (Runtime API) version (CUDART static linking)

Detected 1 CUDA Capable device(s)

Device 0: "GRID V100-4C"
  CUDA Driver Version / Runtime Version          11.3 / 11.3
  CUDA Capability Major/Minor version number:    7.0
  Total amount of global memory:                 4096 MBytes (4294967296 bytes)
  (080) Multiprocessors, (064) CUDA Cores/MP:    5120 CUDA Cores
  GPU Max Clock rate:                            1380 MHz (1.38 GHz)
  Memory Clock rate:                             877 Mhz
  Memory Bus Width:                              4096-bit
  L2 Cache Size:                                 6291456 bytes
  Maximum Texture Dimension Size (x,y,z)         1D=(131072), 2D=(131072, 65536), 3D=(16384, 16384, 16384)
  Maximum Layered 1D Texture Size, (num) layers  1D=(32768), 2048 layers
  Maximum Layered 2D Texture Size, (num) layers  2D=(32768, 32768), 2048 layers
  Total amount of constant memory:               65536 bytes
  Total amount of shared memory per block:       49152 bytes
  Total shared memory per multiprocessor:        98304 bytes
  Total number of registers available per block: 65536
  Warp size:                                     32
  Maximum number of threads per multiprocessor:  2048
  Maximum number of threads per block:           1024
  Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
  Max dimension size of a grid size    (x,y,z): (2147483647, 65535, 65535)
  Maximum memory pitch:                          2147483647 bytes
  Texture alignment:                             512 bytes
  Concurrent copy and kernel execution:          Yes with 7 copy engine(s)
  Run time limit on kernels:                     No
  Integrated GPU sharing Host Memory:            No
  Support host page-locked memory mapping:       Yes
  Alignment requirement for Surfaces:            Yes
  Device has ECC support:                        Disabled
  Device supports Unified Addressing (UVA):      Yes
  Device supports Managed Memory:                No
  Device supports Compute Preemption:            Yes
  Supports Cooperative Kernel Launch:            Yes
  Supports MultiDevice Co-op Kernel Launch:      Yes
  Device PCI Domain ID / Bus ID / location ID:   0 / 0 / 5
  Compute Mode:
     < Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >

deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 11.3, CUDA Runtime Version = 11.3, NumDevs = 1
Result = PASS

Issues

Should we be tweaking the ELF headers of CUDA since it is a binary install?

Compute Canada do this with their setrpaths.sh script.

This needed a tiny modification for use with EESSI (the linker path inside was incorrect). It throws errors though, the source of which we are not really sure about (but suspect it is related to permissions on files):

ldd: warning: you do not have execution permission for `/project/def-sponsor00/easybuild/software/CUDAcore/11.3.1/nsight-systems-2021.1.3/target-linux-x64/libcupti.so.11.1'
patchelf: open: Permission denied
ldd: warning: you do not have execution permission for `/project/def-sponsor00/easybuild/software/CUDAcore/11.3.1/nsight-systems-2021.1.3/target-linux-x64/libcupti.so.11.3'
patchelf: open: Permission denied
ldd: warning: you do not have execution permission for `/project/def-sponsor00/easybuild/software/CUDAcore/11.3.1/nsight-systems-2021.1.3/target-linux-x64/libcupti.so.11.2'
patchelf: open: Permission denied
patchelf: open: Permission denied
patchelf: open: Permission denied
patchelf: open: Permission denied

Need to update the compat libraries on each release

If the user (or more specifically an admin) updates the drivers on the system then the compat libraries will cease to work with errors like:

[ocaisa@gnode1 release]$./deviceQuery
./deviceQuery Starting...

 CUDA Device Query (Runtime API) version (CUDART static linking)

cudaGetDeviceCount returned 803
-> system has unsupported display driver / cuda driver combination
Result = FAIL

We can't really control this. The best we could do is check nvidia-smi and make sure we are still using a compatible compat library

driver_cuda=$(nvidia-smi  -q --display=COMPUTE | grep CUDA | awk 'NF>1{print $NF}' | sed s/\\.//)
eessi_cuda=$(LD_LIBRARY_PATH=/cvmfs/pilot.eessi-hpc.org/host_injections/nvidia/latest/compat/:$LD_LIBRARY_PATH nvidia-smi  -q --display=COMPUTE | grep CUDA | awk 'NF>1{print $NF}' | sed s/\\.//)
if [ "$driver_cuda" -gt "$eessi_cuda" ]; then  echo "You need to update your CUDA compatability libraries"; fi

This could be done on shell initialisation.

Final status

See https://github.com/EESSI/hackathons/tree/05_gpu/2021-12/05_gpu for the scripts that more or less capture the content discussed here.

What about visualisation?

This is a good solution for CUDA, but doesn't cover the GL and EGL libraries that would be needed for visualisation. Having said that if we adopt the GL approach taken at JSC, we should be able to figure out how to correctly set the paths to find the system NVIDIA GL/EGL libraries without needing to do any other magic.

EESSI hackathon Dec'21 - EESSI test suite

Zoom

https://uib.zoom.us/j/63178835002?pwd=SnUzTmFpcmlhS0VueWRwM2RicGtBdz09

Participants

Thomas Röblitz
Hugo Meiland
Caspar van Leeuwen
Vasileous Karakasis (support for ReFrame questions)
(Bob, input on required compat layer tests)

Kickoff Meeting

14:00-15:00 CEST: intro to ReFrame (by Vasileous) Directly after (~15:00-16:00 CEST): planning & dividing tasks

module use -a /project/def-sponsor00/easybuild/modules/all
module load ReFrame/3.9.2
PYTHONPATH=$PYTHONPATH:~/reframe/:~/software-layer/tests/reframe/ reframe -C config/settings.py -c eessi-checks/applications/ -l -t CI -t singlenode

Compat tests

Current tests @ https://github.com/EESSI/compatibility-layer/blob/main/test/compat_layer.py We emerge e.g. https://github.com/EESSI/gentoo-overlay/blob/main/etc/portage/sets/eessi-2021.12-linux-x86_64

Some issues

Some architectures don't support all packages, e.g. OPA-PSM is not support by arm => use skip_if to skip those selected architectures?
New features (e.g. host-injections for libraries, GPU support, building on top of EESSI)

List of tests we need

Compiler tests (Thomas, see e.g. on how to catch https://github.com/EESSI/software-layer/issues/26)
Python (Thomas)
RDMA core (Thomas)
OPA-PSM (Thomas)
test host-injections library (Thomas)

Software-layer tests

Probably it makes sense to not test all individual libraries (cairo, boost, etc) but (at least start with) end-user applications. Maybe a few very key low level libs (that are part of toolchains for example) could be tested though.

WRF (Hugo Meiland)
OpenFOAM
GROMACS (Caspar van Leeuwen)
ParaView
QuantumESPRESSO
Python
R
R-bundle-Bioconductor
UCX
OpenMPI
OSU-Micro-Benchmarks
OpenBLAS?
FFTW?
ScaLAPACK?

Relevant links

Brainstorm on software deployment procedure & testing: https://github.com/EESSI/meetings/wiki/Brainstorm-meeting-software-deployment-Nov-24th-2021
ReFrame library test GROMACS https://github.com/eth-cscs/reframe/blob/v3.9.2/hpctestlib/sciapps/gromacs/benchmarks.py
CSCS implementation GROMACS https://github.com/eth-cscs/reframe/blob/master/cscs-checks/apps/gromacs/gromacs_check.py
Potential EESSI implementation GROMACS https://github.com/casparvl/software-layer/blob/gromacs_cscs/tests/reframe/eessi-checks/applications/gromacs_check.py
Getting access to AWS test cluster https://github.com/EESSI/hackathons/tree/main/2021-12/magic_castle

Daily log

Day 1 (Monday 13th):

kickoff, intro reframe, discussion of tasks/goals
Thomas:
- get access to Magic Castle and CitC resources
- revisit some simple ReFrame tutorials on eessi.learnhpc.eu

Day 2

looking at https://github.com/EESSI/compatibility-layer/blob/main/test/compat_layer.py
- Test script usually does not assume to run from within an EESSI pilot environment. The env vars it accesses at the beginning (EESSI_VERSION, EESSI_OS & EESSI_ARCH) need to be set before the script is run. They must not be confused with the env vars being set for a pilot environment (e.g., EESSI_PILOT_VERSION, EESSI_OS_TYPE & EESSI_CPU_FAMILY). Of course, running in a pilot environment one could reuse these to set the env vars used in the script.
  - export EESSI_VERSION=${EESSI_PILOT_VERSION}
  - export EESSI_OS=${EESSI_OS_TYPE}
  - export EESSI_ARCH=${EESSI_CPU_FAMILY}
- The GitHub Action which uses this test script is available at https://github.com/EESSI/compatibility-layer/blob/main/.github/workflows/pilot_repo.yml#L74
- Playing with setting the above variables to 'odd' values, e.g., macos (on a linux machine), aarch64 (on a x86_64 machine), results in various errors.
  - idea: check for reasonable values of these vars first, only execute other tests if variables have meaningful values
    - revisiting reframe tutorials 3 (https://reframe-hpc.readthedocs.io/en/stable/tutorial_deps.html) & 4 (https://reframe-hpc.readthedocs.io/en/stable/tutorial_fixtures.html)
TODO check ReFrame repo for python/gcc tests in cscs-checks
- see https://github.com/eth-cscs/reframe/tree/v3.9.2/cscs-checks/prgenv
- and https://github.com/eth-cscs/reframe/tree/v3.9.2/cscs-checks/compile
TODO check EESSI issue https://github.com/EESSI/software-layer/issues/26

Getting EESSI GROMACS test (CSCS GROMACS libtest) working on test cluster

The test from https://github.com/casparvl/software-layer/blob/gromacs_cscs/tests/reframe/eessi-checks/applications/gromacs_check.py does not run out of the box.

Default launcher in my settings.py file was srun. That doesn't work, since there is no SLURM integration with pmi2 between the EESSI stack and the host SLURM. Changed config file to use mpirun
By default, jobs on the test cluster only get 9 GB of memory. That seems to not be enough
- Most portable way to fix this seems to be to define an extra_resources in the test https://reframe-hpc.readthedocs.io/en/stable/tutorial_advanced.html?highlight=memory#adding-job-scheduler-options-per-test
- It is not very portable though: you need agreement between the test and the settings.py on the name of the extra resource name (in this case memory seems the most sensible...)
- Should this be defined differently (e.g. with a fixed keyword) in the settings file? There are two aspects to it: how to get more resources (e.g. passing the --mem flag to SLURM) and describing how much resources a partition has (how much memory do the nodes have?)
Job now seems to run, though I still get [node1.int.eessi.learnhpc.eu:23967] pml_ucx.c:273 Error: Failed to create UCP worker
Test still fails, Reason: permission error: [Errno 13] Permission denied: '/home/casparl/.../md.log'. No clue why, the file has the correct file permissions... -rw-rw-r-- 1 casparl casparl 27410 Dec 14 10:27 /home/casparl/.../md.log

In this test, especially with requesting extra memory, I realize I still struggle with some limitations in ReFrame regarding requesting extra resources / checking if sufficient resources are present. I have some ideas on improving portability when it comes to extra resources. (memory, gpus, etc). I think it would be good if ReFrame standardized some resources.

Essentially, there are two components:

I might need to add extra flags to my scheduler to get extra resources (GPU, memory).
I might want to programmatically check from a test if certain resources are available.

Right now, for the first, I could define an extra_resource, but the part I don't like about that from a portability point of view is that the names of extra resources are free text fields. I.e. in the example in the docs, it's called 'name': 'memory' . That means you create a tight relation between the test and the associated config file: both need to agree that this resource is called memory (and not e.g. mem or something else). That means that if I write a (portable) test suite that uses memory as extra resource, I have to instruct all the user of that they have to define a resource in their config file with that exact name.

For the 2nd point, I'd like to have something similar to the processor object, which describes what is present in that particular partition in terms of hardware. E.g. simply a memory_per_node item that describes the maximum amount of memory that is present per node. For GPUs, I now use devices, but it has the same issue: device names are free text, and thus it creates a tight relation between the devices named in the test and in the config file. I circumvent this by isolating this in the utils and hooks.

Building WRF Conus test

Steps to be taken in the reframe test

Download Conus benchmark dataset from http://www2.mmm.ucar.edu/wrf/bench/conus12km_v3911/bench_12km.tar.bz2
- create mirrors? above link is not too fast....
- or even host some benchmark datasets in cvmfs?
mkdir wrf-workdir && cd wrf-workdir
ln -s dirname $(which wrf.exe)/../run/* .
rm namelist.input
ln -s bench_12km/* .
ml load WRF
mpirun wrf.exe
on single node, 16 cores -> 2m44s

Day 3

On the magic castle hackathon login node: When I run reframe --list-tags I get the message

/project/def-sponsor00/easybuild/software/ReFrame/3.9.2/bin/reframe: check path '/project/60005/easybuild/software/ReFrame/3.9.2/lib/python3.9/site-packages/checks' does not exist

Where does this come from? Can/should I change this?
- You can set the ReFrame search path where it searches for tests using the -c argument. Just point it to the dir where you are developing tests.

EESSI GROMACS

Use extra_resources to get GROMACS test to ask for extra memory? Or should we just instruct people to use --mem in the access item of the ReFrame config to ask for the maximum amount of memory?
- The first is pretty tricky: we'd need to check for every use case how much memory is needed (and it potentially varies with node count).
- Probably go for the option of adding --mem=<max_available> to the access config item for now

Day 5

Demo by Caspar

export CHECKOUT_PREFIX=~/eessi-testsuite
mkdir -p $CHECKOUT_PREFIX

# Checkout relevant git repo's
cd $CHECKOUT_PREFIX
git clone https://github.com/casparvl/software-layer.git
git clone https://github.com/eth-cscs/reframe.git
cd reframe
git fetch --all --tags
git checkout tags/v3.9.2
cd ..
cd software-layer/tests/reframe
git checkout gromacs_cscs

# Note: PYTHONPATH needs to be set to find the hpctestlib that comes with ReFrame, as well as eessi-utils/hooks.py and eessi-utils/utils.py
export PYTHONPATH=$PYTHONPATH:$CHECKOUT_PREFIX/reframe/:$CHECKOUT_PREFIX/software-layer/tests/reframe/

# Demonstrating selectiong of tests:
# List tests to be run in CI on build node (only smallest GROMACS test case, single node):
reframe -C config/settings_magic_castle.py -c eessi-checks/applications/ -l -t CI -t singlenode

# List tests to be run in monitoring
reframe -C config/settings_magic_castle.py -c eessi-checks/applications/ -l -t monitoring

# Run actual tests
reframe -C config/settings_magic_castle.py -c eessi-checks/applications/ -r -t CI -t singlenode --performance-report

EESSI hackathon Dec'21 - exporting EESSI to a tarball and/or container image

Milestone 1: get familiar with the env
- git repo
- aws cluster
- enough disk space
Milestone 2: explore some ideas
- make local copy of 2021.06 ... observations:
  - cvmsfs-to-citc copy is kinda slow (26h); there's potential cache polution of this action
  - plenty of files in compat layer only readable to root and cvmfs; need to understand if this makes root privs necessary to perform this action
  - CVMFS_HIDE_MAGIC_XATTRS issue with cvmfs 2.9.0 ... I rolled back to 2.8.2
- examine directory structure and variant symlinks
  - idea: this should allow user to set single env var to point variant symlinks to the tree of choice, even if this tree is outside of /cvmfs hirearchy
  - need issue 32 finalized and in place to try this
- see if we can get binaries in local copy to work as expected
  - see how to run compat layer test suite on local copy
Milestone 3: script this "make local copy" process
Milestone 4: explore archival/restore of local copy
- figure out if anything more than just tar/untar is needed
Milestone 5: script this archival/restore
Milestone 6: explore a creation of container with local copy of eessi
- figure out what to base it on - minimal el8?
- figure out to what degree this makes sense to be scripted
Milestone 7: if archiving whole eessi version turns out to be unfeasible, explore making a local copy of just a specific piece of module and its dependencies
- module load + env | grep EBROOT and copy only those + compat layer
- observations:
  - tar of local copy of eessi is slow too - maybe the underlying aws fs is also not too happy with small files
  - copy+tar of x86_64 compat+foss is 131min
  - almost 2GB of stuff in compat layer /var can be ignored, which brings us down to 40min for foss and under 1h for bioconductor
  - resulting tarball sizes are 1.5GB for foss, 1.8GB for Gromacs and 8.2GB for bioconductor
Milestone 8: wrap up that in a container
- we can possibly adopt some of Jorg's scripts
- I picked latest centos8 as a base
- env vars handling is a big todo
- naming of resulting image also needs to be done better
- tar/untar is there because this script was developed on two systems, can be dropped if everything is available on the same system

Question from my boss: Can we assign something like DOI to these containers?

EESSI hackathon Dec'21 - monitoring

Zoom

https://uib.zoom.us/j/61135526605?pwd=VkZnRXhMVTI1RkIxTis2Vm4yUkRtQT09

The idea

Every Stratum1 gets its own installation of prometheus and grafana.
If the S1 is public, open ports so monitoring.eessi-infra.org can fetch the prometheus data

Ansible roles for stratum1s

https://github.com/cloudalchemy/ansible-prometheus https://github.com/cloudalchemy/ansible-node-exporter https://github.com/cloudalchemy/ansible-grafana

Add

https://gitlab.cern.ch/cloud/cvmfs-prometheus-exporter

With the accompanying grafana dashboard.

Tasks

Create ansible playbook that installs prometheus, node exporter, and grafana, ensure that they listen to localhost only (see URLs above)
Extend same ansible playbook to include CVMFS prometheus exporter
Install default grafana dashboard (copy the json file to /var/lib/dashboards on the server). See https://github.com/cloudalchemy/ansible-grafana/blob/master/defaults/main.yml
Write something smart about alerts from each S1
Open ports for monitoring.eessi-infra.org (we can fix the local firewall, https://docs.ansible.com/ansible/latest/collections/ansible/posix/firewalld_module.html, but what about the rules for ACLs for the node itself?
Install grafana on monitoring.eessi-infra.org and make some pretty dashboards and alerts. For multiple data sources in a single dashboard, see https://stackoverflow.com/questions/63349357/how-to-configure-a-grafana-dashboard-for-multiple-prometheus-datasources

Open questions

Do we need https://prometheus.io/docs/prometheus/latest/federation/ on our monitoring.eessi-infra.org?
Auth and TLS for the services?

Further work

Decouple the ansible roles for stratum0s, stratum1s, clients, and proxies from filesystem-layer into galaxy repos (hosted on github or wherever). For multi-role support in the same repo, see https://github.com/ansible/ansible/issues/16804

EESSI hackathon Dec'21 - setting up a private Stratum-1

Zoom

https://uib.zoom.us/j/62100150823?pwd=OFNpWk9RZ3llTWdqZ0VId1VUMG03UT09

Notes

Purposes

Faster access to EESSI software stack / alleviate access to further-away servers
Offline access through private network

Resources Existing Stratum 1 Ansible script: https://github.com/EESSI/filesystem-layer/blob/main/stratum1.yml

Link to CVMFS workshop: https://cvmfs-contrib.github.io/cvmfs-tutorial-2021/

Tasks

Take a run through the current instructions to setup S1, and document anything site-specific that needs to be set
Check how to define a specific directory to download the S0 snapshot to
Test if a client that is not within the private S1's allowed IP range cannot access the private S1 (result: S1 is accessible by any client)
Test offline usage of private S1

Set up a private S1 from scratch

Install Ansible & required Ansible roles [1]
Apply for a GeoIP license [2]
Set IP ranges for clients accessing this S1 in local_site_specific_vars.yml
Set IP of S1 as hosts in local_site_specific_vars.yml
Execute stratum1.yml playbook with ansible-playbook -b -e @inventory/local_site_specific_vars.yml stratum1.yml (takes several hours -> downloads ~65GB at time of writing)

[1] https://github.com/EESSI/filesystem-layer/blob/main/requirements.yml [2] https://www.maxmind.com/en/geolite2/signup/

Set up the clients to use your private S1

Commands to be run after following the instructions here: https://github.com/EESSI/filesystem-layer#clients

echo 'CVMFS_SERVER_URL="http://<S1_IP>/cvmfs/@fqrn@;$CVMFS_SERVER_URL"' | sudo tee -a /etc/cvmfs/domain.d/eessi-hpc.org.local
export CVMFS_SERVER_URL="http://<S1_IP>/cvmfs/@fqrn@;$CVMFS_SERVER_URL"
sudo cvmfs_config reload -c pilot.eessi-hpc.org

Remarks during setup

CVMFS allows to set the directory to which a repository is mounted to (by default: /srv/cvmfs). This should be exposed in the EESSI config too. Maybe it is already?
I wasn't able to run the Ansible script as localhost on the S1 directly. I get an SSH error (shouldn't SSH if localhost)

Usefull commands during setup

On S1

sudo cvmfs_server check pilot.eessi-hpc.org: verifies the repository content on the S1

On client

curl --head http://<S1IP>/cvmfs/pilot.eessi-hpc.org/.cvmfspublished: check connection to S1 with IP S1_IP
cvmfs_config stat -v pilot.eessi-hpc.org: check which S1 a client uses to connect to
cvmfs_config showconfig pilot.eessi-hpc.org: show the configuration used by a client (to make sure a local config file is picked up correctly)
sudo cvmfs_config killall: kill all CVMFS processes (requires an ls <path_to_repo> to remount the repository under /cvmfs)
sudo cvmfs_config reload -c pilot.eessi-hpc.org: force-reload the client configuration
curl "http://S1_IP/cvmfs/pilot.eessi-hpc.org/api/v1.0/geo/CLIENT_IP/S1_IP,aws-eu-west1.stratum1.cvmfs.eessi-infra.org,azure-us-east1.stratum1.cvmfs.eessi-infra.org,bgo-no.stratum1.cvmfs.eessi-infra.org,rug-nl.stratum1.cvmfs.eessi-infra.org": returns a list of indices (e.g: 1, 5, 4, 3, 2), which ranks the servers (after CLIENT_IP/) on closest to farthest

Issues & Resolutions

If you get "Reload FAILED! CernVM-FS mountpoints unusable. pilot.eessi-hpc.org: Failed to initialize root file catalog (16)", it probably means your CVMFS_SERVER_URL is not set properly. It should end up in the form "http://<S1_IP>/cvmfs/pilot.eessi-hpc.org" if you run cvmfs_config stat -v pilot.eessi-hpc.org

Open questions

Can we download the snapshot on an external (persistent) disk, and can this be attached to another S1, when the current S1 failed -> avoid redownloading the stack