Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error:- cannot find cgroup plugin for cgroup/v2, slurmd initialization failed #39

Open
yashhirulkar701 opened this issue Jun 14, 2024 · 3 comments

Comments

@yashhirulkar701
Copy link

yashhirulkar701 commented Jun 14, 2024

i am trying to create a sclurm cluster on kubernetes (azure kubernetes service). but the slurmd pod keeps crashing giving error "Couldn't find the specified plugin name for cgroup/v2 looking at all files". Have mentioned the errors below for pods  slurmctld and  slurmd.

I have tried to debug it a lot but no luck. Any idea on how to fix this on kubernetes cluster.

i could see that slurmdb is unable to connect with slurmctld as shown below. 

> k logs -f slurmdbd-6f59cc7887-4mwwq 

slurmdbd: debug2: _slurm_connect: failed to connect to 10.244.3.117:6817: Connection refused
slurmdbd: debug2: Error connecting slurm stream socket at 10.244.3.117:6817: Connection refused
slurmdbd: error: slurm_persist_conn_open_without_init: failed to open persistent connection to host:10.244.3.117:6817: Connection refused
slurmdbd: error: slurmdb_send_accounting_update_persist: Unable to open connection to registered cluster linux.
slurmdbd: error: slurm_receive_msg: No response to persist_init
slurmdbd: error: update cluster: Connection refused to linux at 10.244.3.117(6817)



> k logs -f slurmctld-0

slurmctld: debug:  sched: Running job scheduler for full queue.
slurmctld: debug:  create_mmap_buf: Failed to open file `/var/spool/slurmctld/job_state`, No such file or directory
slurmctld: error: Could not open job state file /var/spool/slurmctld/job_state: No such file or directory
slurmctld: error: NOTE: Trying backup state save file. Jobs may be lost!
slurmctld: debug:  create_mmap_buf: Failed to open file `/var/spool/slurmctld/job_state.old`, No such file or directory
slurmctld: No job state file (/var/spool/slurmctld/job_state.old) found
slurmctld: debug2: accounting_storage/slurmdbd: _send_cluster_tres: Sending tres '1=40,2=10,3=0,4=10,5=40,6=0,7=0,8=0' for cluster
slurmctld: debug2: Testing job time limits and checkpoints
slurmctld: debug:  slurm_recv_timeout at 0 of 4, recv zero bytes
slurmctld: error: slurm_receive_msg [10.224.0.5:7132]: Zero Bytes were transmitted or received
slurmctld: debug2: Processing RPC: MESSAGE_NODE_REGISTRATION_STATUS from UID=0
slurmctld: debug2: found existing node slurmd-1 for dynamic future node registration
slurmctld: debug2: dynamic future node slurmd-1/slurmd-1/slurmd-1 assigned to node slurmd-1
slurmctld: debug2: _slurm_rpc_node_registration complete for slurmd-1 
slurmctld: debug:  slurm_recv_timeout at 0 of 4, recv zero bytes
slurmctld: error: slurm_receive_msg [10.224.0.6:38712]: Zero Bytes were transmitted or received
slurmctld: debug2: Processing RPC: MESSAGE_NODE_REGISTRATION_STATUS from UID=0
slurmctld: debug2: found existing node slurmd-0 for dynamic future node registration
slurmctld: debug2: dynamic future node slurmd-0/slurmd-0/slurmd-0 assigned to node slurmd-0
slurmctld: debug2: _slurm_rpc_node_registration complete for slurmd-0 
slurmctld: debug2: Testing job time limits and checkpoints
slurmctld: debug2: Performing purge of old job records
slurmctld: debug:  sched: Running job scheduler for full queue.
slurmctld: debug2: Testing job time limits and checkpoints
slurmctld: debug2: Testing job time limits and checkpoints
slurmctld: debug2: Performing purge of old job records
slurmctld: debug:  sched: Running job scheduler for full queue.
slurmctld: debug2: Testing job time limits and checkpoints


> k logs -f pod/slurmd-0           
---> Set shell resource limits ...
core file size          (blocks, -c) unlimited
data seg size           (kbytes, -d) unlimited
scheduling priority             (-e) 0
file size               (blocks, -f) unlimited
pending signals                 (-i) 3547560
max locked memory       (kbytes, -l) unlimited
max memory size         (kbytes, -m) unlimited
open files                      (-n) 131072
pipe size            (512 bytes, -p) 8
POSIX message queues     (bytes, -q) 819200
real-time priority              (-r) 0
stack size              (kbytes, -s) unlimited
cpu time               (seconds, -t) unlimited
max user processes              (-u) unlimited
virtual memory          (kbytes, -v) unlimited
file locks                      (-x) unlimited
---> Copying MUNGE key ...
---> Starting the MUNGE Authentication service (munged) ...
---> Waiting for slurmctld to become active before starting slurmd...
-- slurmctld is now active ...
---> Starting the Slurm Node Daemon (slurmd) ...
slurmd: CPUs=96 Boards=1 Sockets=2 Cores=48 Threads=1 Memory=886898 TmpDisk=0 Uptime=37960 CPUSpecList=(null) FeaturesAvail=(null) FeaturesActive=(null)
slurmd: debug:  Log file re-opened
slurmd: debug2: hwloc_topology_init
slurmd: debug2: hwloc_topology_load
slurmd: debug2: hwloc_topology_export_xml
slurmd: debug:  CPUs:96 Boards:1 Sockets:2 CoresPerSocket:48 ThreadsPerCore:1
slurmd: error: Couldn't find the specified plugin name for cgroup/v2 looking at all files
slurmd: error: cannot find cgroup plugin for cgroup/v2
slurmd: error: cannot create cgroup context for cgroup/v2
slurmd: error: Unable to initialize cgroup plugin
slurmd: error: slurmd initialization failed
@akram
Copy link

akram commented Aug 29, 2024

Went through the same issue, it seems that the provided image ghcr.io/stackhpc/slurm-docker-cluster does not contain the build of the cgroup v2 plugin:

sh-4.4# ls /usr/lib64/slurm/cgroup*
/usr/lib64/slurm/cgroup_v1.a  /usr/lib64/slurm/cgroup_v1.la  /usr/lib64/slurm/cgroup_v1.so

only cgroup_v1 is supported in this image.
I will check how to build an image with cgroup_v2 or go down with cgroup_v1

@akram
Copy link

akram commented Aug 29, 2024

After adding the cgroups_v2.so plugin I am getting:

slurmd: error: cgroup_dbus_attach_to_scope: cannot connect to dbus system daemon: Failed to connect to socket /run/dbus/system_bus_socket: No such file or directory
slurmd: error: _init_new_scope_dbus: scope and/or cgroup directory for slurmstepd could not be set.
slurmd: error: cannot initialize cgroup directory for stepds: if the scope /sys/fs/cgroup/kubepods.slice/kubepods-besteffort.slice/kubepods-besteffort-poda8507382_5d59_4c0a_8119_c6651b541881.slice/system.slice/slurmstepd.scope already exists it means the associated cgroup directories disappeared and the scope entered in a failed state. You should investigate why the scope lost its cgroup directories and possibly use the 'systemd reset-failed' command to fix this inconsistent systemd state.
slurmd: error: Couldn't load specified plugin name for cgroup/v2: Plugin init() callback failed
slurmd: error: cannot create cgroup context for cgroup/v2
slurmd: error: Unable to initialize cgroup plugin
slurmd: error: slurmd initialization failed

Probably better to disable cgroups by configuration.

@pong1013
Copy link

Hi, I faced the same issue when setting up a k3s cluster on GCP. The problem stems from the cgroup version on the VM. My VM was running on a Debian environment. Here’s how I solved it:

  1. Issue: cgroup/v1
    First, check the cgroup version:

    $ mount | grep cgroup
    cgroup2 on /sys/fs/cgroup type cgroup2 (rw,nosuid,nodev,noexec,relatime,nsdelegate,memory_recursiveprot)
  2. The VM defaulted to cgroup v2, so I had to switch it back to cgroup v1. To do this, edit the GRUB configuration:

    $ sudo nano /etc/default/grub

    Add this line to set the system to use cgroup v1:

    GRUB_CMDLINE_LINUX="... systemd.unified_cgroup_hierarchy=0"
    

    Update the GRUB configuration:

    $ sudo update-grub
    $ sudo reboot
  3. After reboot, verify the cgroup version again:

    $ mount | grep cgroup
    cgroup on /sys/fs/cgroup/cpu,cpuacct type cgroup (rw,nosuid,nodev,noexec,relatime,cpu,cpuacct)
    cgroup on /sys/fs/cgroup/memory type cgroup (rw,nosuid,nodev,noexec,relatime,memory)
  4. Install NFS on Slurm client nodes:

    sudo apt-get update
    sudo apt-get install -y nfs-common

Once these steps were complete, I was able to redeploy the repository, and the Slurm daemon pods started running successfully. This should help you resolve the issue!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants