meeting 2023 05 04

Notes for 2023-05-04 meeting

date & time: Thu 4 May 2023 - 14:00 CEST (12:00 UTC)
- (every first Thursday of the month)
venue: (online, see mail for meeting link, or ask in Slack)
agenda:
- Quick introduction by new people
- EESSI-related meetings in last month (incl. BioHackathon report)
- Progress update per EESSI layer (incl. build-and-deploy bot + test suite)
- EESSI pilot repository
- AWS/Azure sponsorship update
- Update on MultiXscale EuroHPC project (incl. kickoff meeting)
- Upcoming events
- Q&A

Slides

https://raw.githubusercontent.com/EESSI/meetings/main/meetings/EESSI_meeting_20230504.pdf

Meeting notes

(by Kenneth)

Quick introduction by new people

Lara Peeters, joined HPC team UGent to work full time on EESSI/MultiXscale
Xin An, joined HPC team at SURF, will work part of her time on EESSI/MultiXscale
Bart Oldeman (Digital Research Alliance of Canada)

Remark from Alan O'Cais and Bart: the CVMFS repo from The Alliance is part of the default CVMFS config. As soon as EESSI is also part of the default config, you can easily use both environments side by side.

EESSI-related meetings in last month

(see slides)

Bart also joined the compat layer meetings, which was valuable as we run into similar issues
Joined CernVM-FS meetings, talking about them to become part of the default CVMFS config.
- They will release something in June or July, no strict date yet. We'll need to have our new domain etc. in place then.
- There is some usage tracking enabled in CernVM-FS in develop.
- Proposed idea for "Best practices with CernVM-FS for HPC" tutorial. Set up a follow-up meeting with the CVMFS devs. Will be virtual, aimed at HPC sysadmins.
- We'll adapt our cvmfs-tutorial-2021, focus more on how to setup the cache hierarchy, etc.
- We'll get input from big centers, e.g. NERSC, on how they do that
- Some of it is in the CVMFS documentation, but fragmented and not so extensive
- topics: diskless nodes, no internet connection on nodes, impact of CVMFS daemon on HPC workloads, syncing of CVMFS repo to a different filesystem or using alien cache on GPFS, etc.
- Discussion: one of the things we need to figure out is how to deal with sites that don't support deploying additional daemons on batch nodes, or sites that have no local storage in the nodes (for local CVMFS cache). Synchronization, alien cache, etc, on a local filesystem (GPFS, etc) could be a solution to explore.

Progress update per EESSI layer

Filesystem layer

(see slides)

Thomas has an enhanced ingestion script so information can be fed back into the pull requests that was used to create the tarball

Compatibility layer

(see slides)

Thomas: update bootstrap script + more recent commit of Gentoo repo.
- In the past we needed root access on the node where we build the container. Try to avoid that now so we can use any node for building the compat layer + use the bot for building compat layer
- there was a profile for ARM missing, fixed upstream in Gentoo
- build dep for LMOD missing, fixed & merged upstream in Gentoo
- Stick to one version of python, the newest available for Gentoo
- After the bootstrap, we can remove some of the packages such as Cmake, Ninja, Go, Rust, to make the compat layer more minimal
- Tarballs for new compat layer (x86_64 & aarch64) have been created and staged. Not ingested yet, because we ran into trouble when building software on top.
Kenneth: one of the problems we saw was because bootstrap script was broken. Now we are seeing what we can do to help out in having better CI for the Gentoo Prefix project

Software layer

(see slides)

Thomas: bot should check build result (stdout), but to know what to check for is specific to the Software Layer. Thus, we want this to be part of the Software Layer, and not part of the eessi-bot-software-layer repository.
Thomas: better reporting of what is in a tarball, to provide more info to the reviewer
Alan: some GPU apps, you can't create fat binaries. Some you even need to optimize for a single GPU device, some for a single compute capability. One way to deal with this is to have modules with different suffixes, and use the moduleRC file to load the correct one for your architecture.
- Caspar: why is this approach different compared to how we deal with CPU archs? => Because for GPUs we can mostly do fat builds, so you don't want to duplicate all binaries. Also, some packages are specific for very specific GPU models, and would create a lot of duplication.
Kenneth: we had some .pyc files popping up before in installation directories. To avoid that we now have an EasyBuild configuration --read-only-installdir, which will prevent additional changes
- Thomas: will the build fail if a build tries to write? Kenneth: not for .pyc file, those will not be created if a dir is not writeable
Kenneth: Espresso easyconfig is there now.
Kenneth: next thing for software layer is to build on top of new compat layer. We'll setup a regular meeting for software layer as well, like we have for the compat layer.

Build-and-deploy bot

(see slides)

Thomas: merged two PRs in april, still have 5 open
- One is to add more logging if labeler has no permission
- Allow sending commands to the bot to tell it for which architectures to build
- Move check of the job to the target repository (see prior comments)
- If you were debugging interactively, the bot would see the job and report. That should not happen, should be fixed
- Support replaying bot event locally, to not retrigger all running bots
- Implement unit tests
Thomas: Nessi started building software on new compat layer. Had issues with Rust 1.52.1. Replacing with 1.60.0 seems to work.
- Kenneth: would be good to understand what the problem is with Rust 1.52.1...
- Test suite would be very nice to have before ingesting
- We have a troubleshooting page for common issues

EESSI test suite

would be interesting to run a test script (bot/tests.sh) right after bot/build.sh has run
- Q: How to figure out what tests to run? Need some mapping package(/version) -> (ReFrame) test
can also look into running "eb --sanity-check-only" for all available modules
- nice to run in CI or during build-test-deploy procedure run by bot

EESSI authentication

Terje: how do we grant access to resources used in EESSI?
- Everybody has a github account, which have SSH keys
- Can we give someone in a Github team access to a computer?
- Github-authorized-keys (developed by Cloudposse) provides this
- Developed a script to give e.g. an admin team access (give sudo access), or user team (normal user access)
  - Installs github-authorized-keys
  - Now, Terje can login based on his github username, since it is in the right 'team'

EESSI pilot repository

(see slides)

Kenneth:
- Start with new stack, first rebuilding the same software we have now, and then expand with e.g. Espresso

AWS/Azure sponsored credits

(see slides)

Kenneth:
- Spent about 3k per month, half of which is on the filesystem
- Currently about 3k left, so about 1 months worth
- Need to talk to AWS about new credits
- Need to investigate why the FS is costing a lot now

MultiXscale EU project

(see slides)

Kenneth:
- We've been paired with Deucalion via Castiel-2
- It's not live yet, but we'll apply for development access
- Two training events planned:
  - End-user training @ HPCKP 23 (May 18th, Barcelona)
  - Best practices for CernVM-FS on HPC systems (see also comments above),

8th EasyBuild user meeting

Kenneth:
- Presentation from Caspar van Leeuwen on EESSI.
- Covers also an introduction (first 25 minutes or so) into EESSI. Could be useful to send the recording if you want to introduce new people to EESSI. Slides also available. see here for presentation material
- Kurt Lust (LUMI) also made some remarks on EESSI. see here for presentation material

Upcoming events

HPCKP 23, May 18th, Barcelona, registration is free (see QR code on the slides to register)
ISC 23,
- MultiXscale has mini-booth at ISC'23 (part of EuroHPC), 30 minute talk + 15min demo on MultiXscale project & EESSI on Tue 22 in the morning.
- There will be another talk at Azure booth, but no info yet on when.
- EESSI stickers available at the booth! (#D404)

Q&A

...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

meeting 2023 05 04

Notes for 2023-05-04 meeting

Slides

Meeting notes

Quick introduction by new people

EESSI-related meetings in last month

Progress update per EESSI layer

Filesystem layer

Compatibility layer

build dep for LMOD missing, fixed & merged upstream in Gentoo

Software layer

Build-and-deploy bot

EESSI test suite

EESSI authentication

EESSI pilot repository

AWS/Azure sponsored credits

MultiXscale EU project

8th EasyBuild user meeting

Upcoming events

Q&A

Clone this wiki locally