-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(slurm_ops)!: implement SackdManager for sackd
service
#55
base: main
Are you sure you want to change the base?
feat(slurm_ops)!: implement SackdManager for sackd
service
#55
Conversation
…ines Currently the Slurm prometheus exporter is only activated on slurmctld machines, so it doesn't make sense to install the exporter on all nodes. If other nodes need the Slurm exporter, then we can add directly to the package list for that specific service type. Signed-off-by: Jason C. Nucciarone <[email protected]>
BREAKING CHANGES: No longer uses the PPA `ubuntu-hpc/slurm-wlm-23.02` as it only supports Jammy, and it does not provide a `sackd` package. Instead, `_AptManager` now pulls the relevant Slurm packages from either `ubuntu-hpc/experimental` or `universe`. . Eventually we'll need to identify how we want to enable the latest and greatest Slurm version in deb format, but we'll boil that pot later. Signed-off-by: Jason C. Nucciarone <[email protected]>
Signed-off-by: Jason C. Nucciarone <[email protected]>
Signed-off-by: Jason C. Nucciarone <[email protected]>
@@ -908,6 +890,29 @@ def scontrol(*args) -> str: | |||
return _call("scontrol", *args).stdout | |||
|
|||
|
|||
class SackdManager(_SlurmManagerBase): | |||
"""Manager for the Sackd service.""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does SACKD_CONFIG_SERVER need documented here as SLURMD_CONFIG_SERVER is for SlurmdManager?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It should be documented! I'm going to clean up the docstrings while the work on enabling the apt
charm library to handle deb822 formatted sources.
match self._service_name: | ||
case "sackd": | ||
packages.extend(["slurm-client"]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also "libpmix-dev" and "openmpi-bin" to let users build MPI codes and to keep the environment consistent with the compute nodes?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey hey, libpmix-dev is a runtime dependency for slurmd and slurmctld. Openmpi-bin is added to compute nodes so they can run mpi workloads. I don't think either of these packages belong on the login node.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hello! Definitely agreed MPI workloads shouldn't run on the login node. My thought process for having OpenMPI was to allow users to compile MPI binaries for use on the compute nodes since that's commonly done on login nodes (though it'd need "libopenmpi-dev" since "openmpi-bin" provides only mpicc
, etc. compiler wrappers and not mpi.h
).
Maybe worth revisiting in a future PR. Happy to keep this one focused on SackdManager
and stick with "slurm-client".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For now we should have slurm-client
be the only additional package on the login node.
When we put together some howtos for common HPC use cases such as how to submit jobs, that would be a great time to identify other packages we should be shipping on the nodes. Our software stack story is also still loosely defined, so it might not even be the sackd
charm itself providing the packages folks will load and use.
This PR introduces the
SackdManager
class which can be used in a downstream sackd-operator to manage thesackd
service on machines.BREAKING CHANGES: This PR drops the PPA
ubuntu-hpc/slurm-wlm-23.02
as it doesn't provide asackd
package, and the Slurm version that we can pull from Universe on Noble is newer than what is provided by the PPA. Theubuntu-hpc/slurm-wlm-23.02
PPA also doesn't provide any packages that work on Noble.Dropping this PPA means that you will now get an older version of Slurm on Jammy machines, but that isn't a huge issue as we're planning to push the Slurm charms to Noble anyway.
Misc.
Also this PR makes it such that
prometheus-slurm-exporter
is only installed onslurmctld
machines since the slurmctld operator is the only Slurm charm that currently integrates with COS via thecos-agent
relation. Helps minimize our install size just a little bit.