-
Notifications
You must be signed in to change notification settings - Fork 0
meeting 2023 05 04
Kenneth Hoste edited this page May 4, 2023
·
2 revisions
- date & time: Thu 4 May 2023 - 14:00 CEST (12:00 UTC)
- (every first Thursday of the month)
- venue: (online, see mail for meeting link, or ask in Slack)
- agenda:
- Quick introduction by new people
- EESSI-related meetings in last month (incl. BioHackathon report)
- Progress update per EESSI layer (incl. build-and-deploy bot + test suite)
- EESSI pilot repository
- AWS/Azure sponsorship update
- Update on MultiXscale EuroHPC project (incl. kickoff meeting)
- Upcoming events
- Q&A
(by Kenneth)
- Lara Peeters, joined HPC team UGent to work full time on EESSI/MultiXscale
- Xin An, joined HPC team at SURF, will work part of her time on EESSI/MultiXscale
- Bart Oldeman (Digital Research Alliance of Canada)
Remark from Alan O'Cais and Bart: the CVMFS repo from The Alliance is part of the default CVMFS config. As soon as EESSI is also part of the default config, you can easily use both environments side by side.
(see slides)
- Bart also joined the compat layer meetings, which was valuable as we run into similar issues
- Joined CernVM-FS meetings, talking about them to become part of the default CVMFS config.
- They will release something in June or July, no strict date yet. We'll need to have our new domain etc. in place then.
- There is some usage tracking enabled in CernVM-FS in develop.
- Proposed idea for "Best practices with CernVM-FS for HPC" tutorial. Set up a follow-up meeting with the CVMFS devs. Will be virtual, aimed at HPC sysadmins.
- We'll adapt our cvmfs-tutorial-2021, focus more on how to setup the cache hierarchy, etc.
- We'll get input from big centers, e.g. NERSC, on how they do that
- Some of it is in the CVMFS documentation, but fragmented and not so extensive
- topics: diskless nodes, no internet connection on nodes, impact of CVMFS daemon on HPC workloads, syncing of CVMFS repo to a different filesystem or using alien cache on GPFS, etc.
- Discussion: one of the things we need to figure out is how to deal with sites that don't support deploying additional daemons on batch nodes, or sites that have no local storage in the nodes (for local CVMFS cache). Synchronization, alien cache, etc, on a local filesystem (GPFS, etc) could be a solution to explore.
(see slides)
- Thomas has an enhanced ingestion script so information can be fed back into the pull requests that was used to create the tarball
(see slides)
- Thomas: update bootstrap script + more recent commit of Gentoo repo.
- In the past we needed root access on the node where we build the container. Try to avoid that now so we can use any node for building the compat layer + use the bot for building compat layer
- there was a profile for ARM missing, fixed upstream in Gentoo
- Stick to one version of python, the newest available for Gentoo
- After the bootstrap, we can remove some of the packages such as Cmake, Ninja, Go, Rust, to make the compat layer more minimal
- Tarballs for new compat layer (x86_64 & aarch64) have been created and staged. Not ingested yet, because we ran into trouble when building software on top.
- Kenneth: one of the problems we saw was because bootstrap script was broken. Now we are seeing what we can do to help out in having better CI for the Gentoo Prefix project
(see slides)
- Thomas: bot should check build result (stdout), but to know what to check for is specific to the Software Layer. Thus, we want this to be part of the Software Layer, and not part of the
eessi-bot-software-layer
repository. - Thomas: better reporting of what is in a tarball, to provide more info to the reviewer
- Alan: some GPU apps, you can't create fat binaries. Some you even need to optimize for a single GPU device, some for a single compute capability. One way to deal with this is to have modules with different suffixes, and use the moduleRC file to load the correct one for your architecture.
- Caspar: why is this approach different compared to how we deal with CPU archs? => Because for GPUs we can mostly do fat builds, so you don't want to duplicate all binaries. Also, some packages are specific for very specific GPU models, and would create a lot of duplication.
- Kenneth: we had some
.pyc
files popping up before in installation directories. To avoid that we now have an EasyBuild configuration--read-only-installdir
, which will prevent additional changes- Thomas: will the build fail if a build tries to write? Kenneth: not for
.pyc
file, those will not be created if a dir is not writeable
- Thomas: will the build fail if a build tries to write? Kenneth: not for
- Kenneth: Espresso easyconfig is there now.
- Kenneth: next thing for software layer is to build on top of new compat layer. We'll setup a regular meeting for software layer as well, like we have for the compat layer.
(see slides)
-
Thomas: merged two PRs in april, still have 5 open
- One is to add more logging if labeler has no permission
- Allow sending commands to the bot to tell it for which architectures to build
- Move check of the job to the target repository (see prior comments)
- If you were debugging interactively, the bot would see the job and report. That should not happen, should be fixed
- Support replaying bot event locally, to not retrigger all running bots
- Implement unit tests
-
Thomas: Nessi started building software on new compat layer. Had issues with Rust 1.52.1. Replacing with 1.60.0 seems to work.
- Kenneth: would be good to understand what the problem is with Rust 1.52.1...
- Test suite would be very nice to have before ingesting
- We have a troubleshooting page for common issues
- would be interesting to run a test script (
bot/tests.sh
) right afterbot/build.sh
has run- Q: How to figure out what tests to run? Need some mapping package(/version) -> (ReFrame) test
- can also look into running "eb --sanity-check-only" for all available modules
- nice to run in CI or during build-test-deploy procedure run by bot
- Terje: how do we grant access to resources used in EESSI?
- Everybody has a github account, which have SSH keys
- Can we give someone in a Github team access to a computer?
-
Github-authorized-keys
(developed by Cloudposse) provides this - Developed a script to give e.g. an admin team access (give sudo access), or user team (normal user access)
- Installs
github-authorized-keys
- Now, Terje can login based on his github username, since it is in the right 'team'
- Installs
(see slides)
- Kenneth:
- Start with new stack, first rebuilding the same software we have now, and then expand with e.g. Espresso
(see slides)
- Kenneth:
- Spent about 3k per month, half of which is on the filesystem
- Currently about 3k left, so about 1 months worth
- Need to talk to AWS about new credits
- Need to investigate why the FS is costing a lot now
(see slides)
- Kenneth:
- We've been paired with Deucalion via Castiel-2
- It's not live yet, but we'll apply for development access
- Two training events planned:
- End-user training @ HPCKP 23 (May 18th, Barcelona)
- Best practices for CernVM-FS on HPC systems (see also comments above),
- Kenneth:
- Presentation from Caspar van Leeuwen on EESSI.
- Covers also an introduction (first 25 minutes or so) into EESSI. Could be useful to send the recording if you want to introduce new people to EESSI. Slides also available. see here for presentation material
- Kurt Lust (LUMI) also made some remarks on EESSI. see here for presentation material
- HPCKP 23, May 18th, Barcelona, registration is free (see QR code on the slides to register)
- ISC 23,
- MultiXscale has mini-booth at ISC'23 (part of EuroHPC), 30 minute talk + 15min demo on MultiXscale project & EESSI on Tue 22 in the morning.
- There will be another talk at Azure booth, but no info yet on when.
- EESSI stickers available at the booth! (#D404)
- ...