Skip to content

A Puppet module designed to configure and manage SLURM(see https://slurm.schedmd.com/), an open source, fault-tolerant, and highly scalable cluster management and job scheduling system for large and small Linux clusters

License

Notifications You must be signed in to change notification settings

ULHPC/puppet-slurm

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

-- mode: markdown; mode: visual-line; --

Slurm Puppet Module

Puppet Forge License Supported Platforms Documentation Status Build Status

Configure and manage Slurm: A Highly Scalable Resource Manager

  Copyright (c) 2017-2021 UL HPC Team <[email protected]>
  .             see also http://hpc.uni.lu

Overview

Slurm (aka "Simple Linux Utility for Resource Management") is a free and open-source job scheduler for Linux and Unix-like kernels, used by many of the world's supercomputers and computer clusters (~60% of Top500 rely on it).

Slurm is an open source, fault-tolerant, and highly scalable cluster management and job scheduling system for large and small Linux clusters

It provides three key functions.

  1. it allocates exclusive and/or non-exclusive access to resources (computer nodes) to users for some duration of time so they can perform work.
  2. it provides a framework for starting, executing, and monitoring work (typically a parallel job such as MPI) on a set of allocated nodes.
  3. Finally, it arbitrates contention for resources by managing a queue of pending jobs.

This Puppet module is designed to configure and manage the different daemons and components of a typical Slurm architecture, depicted below:

In particular, this module implements the following elements:

Puppet Class Description
slurm The main slurm class, piloting all aspects of the configuration
slurm::slurmdbd Specialized class for Slurmdbd, the Slurm Database Daemon.
slurm::slurmctld Specialized class for Slurmctld, the central management daemon of Slurm.
slurm::slurmd Specialized class for Slurmd, the compute node daemon for Slurm.
slurm::login Specialized class to configure a Login node (i.e. without any of the slurm daemons)
slurm::munge Manages MUNGE, an authentication service for creating and validating credentials.
slurm::pam Handle PAM aspects for SLURM (Memlock for MPI etc.)
slurm::params Defaults parameters for all the module classes/definition
slurm::plugins Handles all default Slurm plugins -- NOT YET IMPLEMENTED
slurm::pmix Handle PMIx aspects (download, build and installation) to make SLURM build compliant with PMIx, PMI1 and PMI2
slurm::repo Takes care of the control repository hosting the slurm configuration of the cluster
Puppet Defines Description
slurm::acct:mgr Generic wrapper for all sacctmgr commands
slurm::acct::{account,cluster,qos,user} adding (or removing) a {account,cluster,qos,user} to the slurm accounting database
slurm::build building Slurm sources into packages (i.e. RPMs for the moment) for a given version passed as resource name
slurm::download takes care of downloading the SLURM sources for a given version passed as resource name
slurm::firewall takes care of firewall aspects for SLURM
slurm::install::packages installs the Slurm packages, typically built from slurm::build, for a given version passed as resource name.
slurm::pmix::{download,build,install} download, build and install PMIx
slurm::repo::syncto synchronizes the content of the slurm control repository (see slurm::repo) toward a directory (typically a shared mountpoint)

In addition, this puppet module implements several private classes:

Also, a couple of extra definition in used in our infrastructure:

  • slurm::repo::syncto: synchronize the control repository of the slurm configuration (which is cloned using the 'slurm::repo' class) toward a directory (typically a shared GPFS/NFS mountpoint to make it available to all login and compute nodes)

All these components are configured through a set of variables you will find in manifests/params.pp.

Note: the various operations that can be conducted from this repository are piloted from a Rakefile and assumes you have a running Ruby installation. See docs/contributing.md for more details on the steps you shall follow to have this Rakefile working properly.

IMPORTANT Until the release of version 1.0 (denoting a usage in production on the UL HPC Platform), this module is still to be considered in alpha state and a work in progress. Use it at your own risks!

Setup Requirements

This module currently only works completely on Redhat / CentOS 7 over Puppet >= 4.x. Over operating systems and support for Puppet 5.x and above seems to work but is not guaranteed. Yet feel free to contribute to this module to help us extending the usage of this module.

By default, some key configuration decisions are configured, namely:

  • MUNGE is used for shared key authentication.
    • the shared key is generated by default, but you probably want to provide it to puppet via a URI.
  • None of the daemons are configured by default.
    • You have to set the boolean parameter(s) with_{slurmdbd,slurmctld,slurmd} to true and/or include explicitly the slurm::{slurmdbd,slurmctld,slurmd} classes
  • On a production system, you probably wants to follow the following tips:
    • set globally service_manage to false - you probably want to control when the daemons are restarted (typically to ensure the slurm config is really in sync across the cluster)

    • set global do_package_install to false

    • maintain a single source of authority for the shared slurm configuration as a Git repository (called here the "slurm control repository")

      • your slurm controller(s) servers would then rely on the slurm::repo* classes/definitions to maintain the consistency between the local config (in /etc/slurm, as generated by this module upon puppet runs and its extensive hiera configuration capabilities) and this repository.
      • your login and compute nodes not correlate the local /etc/slurm directory with
    • you are advised to set the service_manage to false

Forge Module Dependencies

See metadata.json. In particular, this module depends on

Overview and Usage

The best way to use this module in a flexible way is to rely on Hiera coupled with a role and profile.

The main classes are now detailed.

Class slurm

This is the main class defined in this module. It accepts so many parameters that they are not listed here -- see the [puppet strings @param] comments of manifests/init.pp Use it as follows:

include ::slurm

In which case you can define the class parameters using Hiera -- see for instance the default hiera configuration (used effectively in the vagrant deployment) in hieradata/default.yaml.

You can also prefer a profile-based approach -- see profiles::slurm as a sample profile (base) class used for slurm general settings.

Other usage examples are proposed in tests/init.pp, a more advanced usage (defining the network topology, the computing nodes and the SLURM partitions) in tests/advanced.pp.

Class slurm::slurmdbd

This class is responsible for setting up a Slurm Database Daemon, which provides a secure enterprise-wide interface to a database for Slurm. In particular, it can run relatively independently of the other slurm daemon instances and thus is proposed as a separate independent class.

You can simply configure it as follows:

include ::slurm
include ::slurm::slurmdbd

Alternatively, you can use the with_slurdbd parameter of the ::slurm class:

class { '::slurm':
    with_slurmdbd => true,
}

See also tests/slurmdbd.pp, the sample profile profiles::slurm::slurmdbd.

The slurm::slurmdbd accepts also so many parameters that they are not listed here -- see the [puppet strings @param] comments of manifests/slurmdbd.pp for more details.

For a sample Hiera, see hieradata/default.yaml (effectively used in the vagrant-based deployment).

Class slurm::slurmctld

The main helper class specializing the main slurm class for setting up a Slurm Head node (where the slurmctld daemon runs).

include ::slurm
include ::slurm::slurmctld

Alternatively, you can use the with_slurctld parameter of the ::slurm class:

class { '::slurm':
    with_slurmctld => true,
}

See also tests/slurmctld.pp, the sample profile profiles::slurm::slurmctld.

Class slurm::slurmd

The main helper class specializing the main slurm class for setting up __ Slurm Compute node__ i.e. where the slurmd daemon runs.

include ::slurm
include ::slurm::slurmd

Alternatively, you can use the with_slurmd parameter of the ::slurm class:

class { '::slurm':
    with_slurmd => true,
}

Class slurm::login

The main helper class specializing the main slurm class for setting up __ Slurm Login node__ i.e. where none of the slurm daemon runs (yet the slurm CLI commands are installed via the slurm package).

include ::slurm
include ::slurm::login

See also tests/login_node.pp, the sample profile profiles::slurm::login.

Class slurm::munge

MUNGE (MUNGE Uid 'N' Gid Emporium) is an authentication service for creating and validating credentials. It is designed to be highly scalable for use in an HPC cluster environment. It allows a process to authenticate the UID and GID of another local or remote process within a group of hosts having common users and groups. These hosts form a security realm that is defined by a shared cryptographic key. Clients within this security realm can create and validate credentials without the use of root privileges, reserved ports, or platform-specific methods.

For more information, see https://github.com/dun/munge

The puppet class slurm::munge is thus responsible for setting up a working Munge environment to be used by the SLURM daemons -- see also https://slurm.schedmd.com/authplugins.html Use it as follows:

include ::slurm::munge

Or, if you wish to provide the munge key using puppet URI:

class {'::slurm::munge':
    ensure     => true,
    key_source => "puppet:///modules/${myprofile}/munge.key"
}

If, as in the above example, the key is stored centrally in your control repository, you probably want to store it encrypted using git-crypt for instance.

The slurm::munge class accepts the following parameters:

  • ensure [String] Default: 'present'
    • Ensure the presence (or absence) of the Munge service
  • create_key [Boolean] Default: true
    • Whether or not to generate a new key if it does not exists
  • daemon_args [Array] Default: []
  • gid [Integer] Default: 992
    • GID of the munge group
  • key_content [String] Default: undef
    • The desired contents of a file, as a string. This attribute is mutually exclusive with source and target.
  • key_filename [String] Default: '/etc/munge/munge.key'
    • The secret key filename
  • key_source [String] Default: undef
    • A source file, which will be copied into place on the local system. This attribute is mutually exclusive with content. The normal form of a puppet: URI is puppet:///modules/<MODULE NAME>/<FILE PATH>
  • uid [Integer] Default: 992
    • UID of the munge user

Note that the slurm class makes use of this class by default as the parameter manage_munge is set to true by default.

Definition slurm::download

This definition takes care of downloading the SLURM sources for a given version (passed as name to this resource) and placing them into $target directory. You can also invoke this definition with the full archive filename i.e. slurm-<version>.tar.bz2.

  • ensure [String] Default: present
    • Ensure the presence (or absence) of building
  • target [String] Default: /usr/local/src
    • Target directory for the downloaded sources
  • checksum_type [String] Default: md5
    • archive file checksum type (none|md5|sha1|sha2|sh256|sha384| sha512).
  • checksum_verify [Boolean] Default: false
    • whether checksum will be verified (true|false).
  • checksum [String] Default: ''
    • archive file checksum (match checksum_type)

Example: Downloading version 19.05.3-2 (latest at the time of writing) of SLURM

 slurm::download { '19.05.3-2':
    ensure        => 'present',
    checksum      => '6fe2c6196f089f6210d5ba79e99b0656f5a527b4',
    checksum_type => 'sha1',
    target        => '/usr/local/src/',
 }

Definition slurm::build

This definition takes care of building Slurm sources into RPMs using 'rpmbuild'. It expect to get as resource name the SLURM version to build This assumes the sources have been downloaded using slurm::download

  • ensure [String] Default: present
    • Ensure the presence (or absence) of building
  • srcdir [String] Default: /usr/local/src
    • Where the [downloaded] Slurm sources are located
  • dir [String] Default: /root/rpmbuild on redhat systems
    • Top directory of the sources builds (i.e. RPMs, debs...). For instance, built RPMs will be placed under ${dir}/RPMS/${::architecture}
  • with [Array] Default: [ 'lua', ... ]
  • without [Array] Default: []

Example: Building version 17.11.12 (latest at the time of writing) of SLURM

slurm::build { '17.11.12':
  ensure => 'present',
  srcdir => '/usr/local/src',
  dir    => '/root/rpmbuild',
  with   => [ 'lua', 'mysql', 'openssl' ]
}

Definition slurm::install::packages

This definition takes care of installing the Slurm packages, typically built from slurm::build, for a given version passed as resource name.

Example: installing slurmd packages in version 17.02.7:

slurm::install::packages { '17.11.12':
   ensure => 'present',
   pkgdir => "/root/rpmbuild/RPMs/${::architecture}",
   slurmd => true
}

Librarian-Puppet / R10K Setup

You can of course configure the slurm module in your Puppetfile to make it available with Librarian puppet or r10k by adding the following entry:

 # Modules from the Puppet Forge
 mod "ULHPC/slurm"

or, if you prefer to work on the git version:

 mod "ULHPC/slurm",
     :git => 'https://github.com/ULHPC/puppet-slurm',
     :ref => 'production'

Hiera

You can see example of hiera configurations for this module under tests/vagrant/puppet/hieradata.

Issues / Feature request

You can submit bug / issues / feature request using the slurm Puppet Module Tracker.

Developments / Contributing to the code

If you want to contribute to the code, you shall be aware of the way this module is organized. These elements are detailed on docs/contributing.md.

You are more than welcome to contribute to its development by sending a pull request.

Puppet modules tests within a Vagrant box

The best way to test this module in a non-intrusive way is to rely on Vagrant. The Vagrantfile at the root of the repository pilot the provisioning of a virtual cluster configuring Slurm from the puppet provisionning capability of Vagrant over this module.

$> vagrant status
Current machine states:

slurm-master   not created (virtualbox)
access         not created (virtualbox)
node-1         not created (virtualbox)
node-2         not created (virtualbox)

This environment represents multiple VMs. The VMs are all listed
above with their current state. For more information about a specific
VM, run `vagrant status NAME`.

$> vagrant up
[...]
+--------------|--------------------------|---------|----------|------------|-------------------------------|-------------+
|                                    Puppet Testing infrastructure deployed on Vagrant                                    |
+--------------|--------------------------|---------|----------|------------|-------------------------------|-------------+
| Name         | Hostname                 | OS      | vCPU/RAM | Role       | Description                   | IP          |
+--------------|--------------------------|---------|----------|------------|-------------------------------|-------------+
| slurm-master | slurm-master.vagrant.dev | centos7 | 2/2048   | controller | Slurm Controller #1 (primary) | 10.10.1.11  |
| access       | access.vagrant.dev       | centos7 | 1/1024   | login      | Cluster frontend              | 10.10.1.2   |
| node-1       | node-1.vagrant.dev       | centos7 | 2/512    | node       | Computing Node #1             | 10.10.1.101 |
| node-2       | node-2.vagrant.dev       | centos7 | 2/512    | node       | Computing Node #2             | 10.10.1.102 |
+--------------|--------------------------|---------|----------|------------|-------------------------------|-------------+
- Virtual Puppet Testing infrastructure deployed deployed!

Note: it takes roughly 38 minutes to deploy the full cluster from scratch. So be patient ;)

You can then test modifications of each configuration in the hiera file tests/vagrant/puppet/custom.yaml and see the result by applying for instance:

$> vagrant provision --provision-with puppet slurm-master

See docs/vagrant.md for more details.

Online Documentation

Read the Docs aka RTFD hosts documentation for the open source community and the slurm puppet module has its documentation (see the docs/ directly) hosted on readthedocs.

See docs/rtfd.md for more details.

Licence

This project and the sources proposed within this repository are released under the terms of the Apache-2.0 licence.

Licence

About

A Puppet module designed to configure and manage SLURM(see https://slurm.schedmd.com/), an open source, fault-tolerant, and highly scalable cluster management and job scheduling system for large and small Linux clusters

Topics

Resources

License

Stars

Watchers

Forks

Packages

No packages published