Skip to content

Commit

Permalink
👌 IMPROVE: Remove hostname setting
Browse files Browse the repository at this point in the history
Instead dynamically set the slurm configuration with the original hostname.
  • Loading branch information
chrisjsewell committed Dec 9, 2020
1 parent 2f3177f commit d5bb11d
Show file tree
Hide file tree
Showing 10 changed files with 17 additions and 89 deletions.
10 changes: 2 additions & 8 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,10 +9,8 @@ An Ansible role that installs the [slurm](https://slurm.schedmd.com/) workload m
The role:

- Installs the slurm packages
- Sets the hostname to be that defined in `slurm_hostname`
- If `slurm_hostname_service: true` adds a service to set the hostname on VM start-up (required for cloud platforms)
- Sets up the slurm configuration (`/etc/slurm-llnl/slurm.conf`) to dynamically use the correct platform resources (#CPUs, etc), configuring one node and one partition.
- Adds a `slurm-resources` script and start-up service to automate the initiation of correct platform resources (required if creating a VM instance with different resources to the build VM image)
- Sets up the slurm configuration (`/etc/slurm-llnl/slurm.conf`) to dynamically use the correct platform resources (hostname, #CPUs, etc), configuring one node (named `$HOSTNAME`) and one partition (named `slurm_partition_name`).
- Adds a `slurm-resources` script and start-up service to automate the initiation of correct platform resources (required if creating a VM image where instances may have different resources)
- Starts the slurm services.

To check the services are running (assuming systemd in use):
Expand Down Expand Up @@ -48,10 +46,6 @@ $ slurm-resources -e restart_on_change=true -e slurm_max_cpus=2

This will update the resources defined for the node, set the maximum CPUs for the partition to 2 (independent of the CPUs available on the node), and restart the slurm services with the updated configuration (if the configuration has changed).

**NOTE!**
It is important that the hostname is properly set in the machine
(both with `hostname <HOSTNAME>` (which sets `/etc/hostname`) and in the `/etc/hosts` file, in the line with the IP address (e.g. in docker this line should read `172.x.x.x <HOSTNAME>`, where `<HOSTNAME>` should be replaced with the hostname, and should match the variable `slurm_hostname` (default value: `qmobile`).

## Installation

`ansible-galaxy install marvel-nccr.slurm`
Expand Down
7 changes: 0 additions & 7 deletions defaults/main.yml
Original file line number Diff line number Diff line change
@@ -1,15 +1,8 @@
slurm_hostname: qmobile
slurm_user: slurm
slurm_cluster_name: "{{ slurm_hostname }}"
slurm_partition_name: jobs

slurm_test_folder: "/tmp/slurm-tests"

# Adds a system service that forces the hostname on startup
# Note: This is e.g. necessary on cloud platforms like AWS
slurm_hostname_service: false
slurm_set_hostname: false

# Enables the slurm-resources system service that re-configures the slurm compute resources on startup
# This is necessary when preparing image that can start on a different hardware than it was built on
slurm_resources_service_enabled: true
5 changes: 0 additions & 5 deletions molecule/default/converge.yml
Original file line number Diff line number Diff line change
@@ -1,9 +1,4 @@
- hosts: all

vars:
- run_tests: true
- cloud_platform: docker
- slurm_hostname_service: true

roles:
- role: marvel-nccr.slurm
1 change: 1 addition & 0 deletions molecule/default/molecule.yml
Original file line number Diff line number Diff line change
Expand Up @@ -33,3 +33,4 @@ provisioner:
all:
vars:
ansible_python_interpreter: /usr/bin/python3
run_tests: true
19 changes: 0 additions & 19 deletions tasks/hostname.yml

This file was deleted.

21 changes: 0 additions & 21 deletions tasks/hostname_service.yml

This file was deleted.

5 changes: 0 additions & 5 deletions tasks/main.yml
Original file line number Diff line number Diff line change
Expand Up @@ -75,11 +75,6 @@
args:
creates: "/etc/munge/munge.key"

- import_tasks: hostname.yml

- include_tasks: hostname_service.yml
when: slurm_hostname_service

# munge should start before slurm daemons
# https://slurm.schedmd.com/quickstart_admin.html
- name: start munge service
Expand Down
15 changes: 11 additions & 4 deletions templates/config-playbook.yml.j2
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,6 @@
hosts: localhost

vars:
slurm_hostname: "{{ slurm_hostname }}"
slurm_partition_name: "{{ slurm_partition_name }}"
{% raw %}
slurm_conf_file: /etc/slurm-llnl/slurm.conf
Expand All @@ -15,14 +14,22 @@
- debug:
msg: "Run: {{ lookup('pipe', 'date +%Y-%m-%d-%H:%M:%S') }}"

- name: Update SLURM configuration
- name: "Update SLURM ControlMachine={{ ansible_hostname }}"
become: true
lineinfile:
dest: "{{ slurm_conf_file }}"
regexp: '^ControlMachine='
line: "ControlMachine={{ ansible_hostname }}"
state: present

- name: Update SLURM node configuration
become: true
blockinfile:
path: "{{ slurm_conf_file }}"
marker: "# {mark} ANSIBLE MANAGED NODES"
block: |
NodeName={{ slurm_hostname }} Sockets={{ ansible_processor_count }} CoresPerSocket={{ ansible_processor_cores }} ThreadsPerCore={{ ansible_processor_threads_per_core }} State=UNKNOWN
PartitionName={{ slurm_partition_name }} Nodes={{ slurm_hostname }} Default=YES MaxTime=INFINITE State=UP MaxNodes=1 MaxCPUsPerNode={{ slurm_max_cpus }}
NodeName={{ ansible_hostname }} Sockets={{ ansible_processor_count }} CoresPerSocket={{ ansible_processor_cores }} ThreadsPerCore={{ ansible_processor_threads_per_core }} State=UNKNOWN
PartitionName={{ slurm_partition_name }} Nodes={{ ansible_hostname }} Default=YES MaxTime=INFINITE State=UP MaxNodes=1 MaxCPUsPerNode={{ slurm_max_cpus }}
backup: yes
register: update

Expand Down
17 changes: 0 additions & 17 deletions templates/fix-hostname

This file was deleted.

6 changes: 3 additions & 3 deletions templates/slurm.conf
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@
# Put this file on all nodes of your cluster.
# See the slurm.conf man page for more information.
#
ControlMachine={{ slurm_hostname }}
ControlMachine={{ ansible_hostname }}
#ControlAddr=
#
MailProg=/usr/sbin/sendmail
Expand Down Expand Up @@ -53,6 +53,6 @@ SlurmdLogFile=/var/log/slurm-llnl/slurmd.log
#
#
# BEGIN ANSIBLE MANAGED NODES
NodeName={{ slurm_hostname }} Sockets={{ ansible_processor_count }} CoresPerSocket={{ ansible_processor_cores }} ThreadsPerCore={{ ansible_processor_threads_per_core }} State=UNKNOWN
PartitionName={{ slurm_partition_name }} Nodes={{ slurm_hostname }} Default=YES MaxTime=INFINITE State=UP MaxNodes=1 MaxCPUsPerNode={{ ansible_processor_vcpus }}
NodeName={{ ansible_hostname }} Sockets={{ ansible_processor_count }} CoresPerSocket={{ ansible_processor_cores }} ThreadsPerCore={{ ansible_processor_threads_per_core }} State=UNKNOWN
PartitionName={{ slurm_partition_name }} Nodes={{ ansible_hostname }} Default=YES MaxTime=INFINITE State=UP MaxNodes=1 MaxCPUsPerNode={{ ansible_processor_vcpus }}
# END ANSIBLE MANAGED NODES

0 comments on commit d5bb11d

Please sign in to comment.