Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use slurmd's own detection for node definition #96

Draft
wants to merge 1 commit into
base: master
Choose a base branch
from
Draft

Conversation

sjpb
Copy link
Collaborator

@sjpb sjpb commented Feb 24, 2021

Currently the node definitions are constructed using ansible facts. At least in some situations this doesn't appear entirely satisfactory to slurm, e.g. slurmd -C shows ... Boards=1 ... and nodes are getting set DOWN.

This PR runs slurmd -C on all compute nodes, then uses values from the first-in-play in each partition (iaw existing logic) to provide node definitions.

This is sort of Trust On First Use that the node configuration is in fact correct.

An alternative is only to specify NodeName and not the expected CPU parameters at all:

Only the NodeName must be supplied in the configuration file. All other node configuration information is optional.

This would have 2x disadvantages:

  • Slurm cannot detect node misconfiguration
  • Scheduling is slower:

    Establishing baseline configurations will also speed Slurm's scheduling process by permitting it to compare job requirements against these (relatively few) configuration parameters and possibly avoid having to check job requirements against every individual node's configuration. The resources checked at node registration time are: CPUs, RealMemory and TmpDisk.

Quotes from https://slurm.schedmd.com/slurm.conf.html.

@sjpb sjpb marked this pull request as draft August 25, 2021 15:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant