From 9248b5f63f4950b76d3dfda1d0929b6dcac37293 Mon Sep 17 00:00:00 2001 From: Steve Brasier Date: Fri, 18 Aug 2023 13:49:01 +0000 Subject: [PATCH 1/6] update copywrite info --- LICENSE | 1 + 1 file changed, 1 insertion(+) diff --git a/LICENSE b/LICENSE index 16bc43f..c16a265 100644 --- a/LICENSE +++ b/LICENSE @@ -1,6 +1,7 @@ MIT License Copyright (c) 2019 Giovanni Torres +Copyright (c) 2023 StackHPC Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal From 96e45e4c0585979972a1ec8a95b809ba012a8053 Mon Sep 17 00:00:00 2001 From: Steve Brasier Date: Fri, 18 Aug 2023 13:49:56 +0000 Subject: [PATCH 2/6] Add features, limitations and acknowlegments to README --- README.md | 21 ++++++++++++++++++--- 1 file changed, 18 insertions(+), 3 deletions(-) diff --git a/README.md b/README.md index c0b7d61..4dd3e70 100644 --- a/README.md +++ b/README.md @@ -1,7 +1,16 @@ # Slurm Docker Cluster -This is a multi-container Slurm cluster using Kubernetes. The Slurm cluster Helm chart creates a named volume for persistent storage of MySQL data files. By default, it also installs the -RookNFS Helm chart (also in this repo) to provide shared storage across the Slurm cluster nodes. +A Helm chart and Dockerfile to run a multi-container Slurm cluster on Kubernetes, featuring: + +* Control, login, slurmd (worker), slurmdbd and mariadb pods. +* A shared `/home` directory across the slurm pods, by default via an install of RookNFS to provide a storage class with Read Write Many (RWX) capabilities. +* SSH and and HTTPS access to the login pod with an Open Ondemand web GUI. +* A single slurmd pod per Kubernetes worker node with automatic definition of slurm node memory and CPU configuration. +* Slurm jobs run inside the slurmd pods, using host networking for maximum MPI performance. +* Open MPI installed with support for Slurm's `srun` launcher (via `pmix`) - see example below. +* Support for containerised jobs via Apptainer - see example below. +* Job accounting information retained across container upgrades via a persistent volume claim. +* Credentials/secrets are generated during the Helm install, not embedded in images. ## Dependencies @@ -178,4 +187,10 @@ and then restart the other dependent deployments to propagate changes: kubectl rollout restart deployment slurmd slurmctld login slurmdbd ``` -# Known Issues +# Limitations and Known Issues +- Only a single cluster should be deployed per Kubernetes namespace. +- Only the `rocky` user is currently supported. + +# Acknowlegements + +Originally based on https://github.com/giovtorres/slurm-docker-cluster which defines a docker-compose -based cluster. From 99a671b07a3adfd4346c6267d4833c524b21da8d Mon Sep 17 00:00:00 2001 From: Scott Davidson <49713135+sd109@users.noreply.github.com> Date: Fri, 18 Aug 2023 14:56:47 +0100 Subject: [PATCH 3/6] Fix typo --- README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/README.md b/README.md index 4dd3e70..470de44 100644 --- a/README.md +++ b/README.md @@ -4,7 +4,7 @@ A Helm chart and Dockerfile to run a multi-container Slurm cluster on Kubernetes * Control, login, slurmd (worker), slurmdbd and mariadb pods. * A shared `/home` directory across the slurm pods, by default via an install of RookNFS to provide a storage class with Read Write Many (RWX) capabilities. -* SSH and and HTTPS access to the login pod with an Open Ondemand web GUI. +* SSH and HTTPS access to the login pod with an Open Ondemand web GUI. * A single slurmd pod per Kubernetes worker node with automatic definition of slurm node memory and CPU configuration. * Slurm jobs run inside the slurmd pods, using host networking for maximum MPI performance. * Open MPI installed with support for Slurm's `srun` launcher (via `pmix`) - see example below. From b185ce7ff9676f07c3a747465d0ce6f68b15f703 Mon Sep 17 00:00:00 2001 From: Scott Davidson <49713135+sd109@users.noreply.github.com> Date: Fri, 18 Aug 2023 14:56:56 +0100 Subject: [PATCH 4/6] Fix typo --- README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/README.md b/README.md index 470de44..8daf9f0 100644 --- a/README.md +++ b/README.md @@ -3,7 +3,7 @@ A Helm chart and Dockerfile to run a multi-container Slurm cluster on Kubernetes, featuring: * Control, login, slurmd (worker), slurmdbd and mariadb pods. -* A shared `/home` directory across the slurm pods, by default via an install of RookNFS to provide a storage class with Read Write Many (RWX) capabilities. +* A shared `/home` directory across the slurm pods, by default via an install of RookNFS to provide a storage class with Read Write Many (RWX) capabilities. * SSH and HTTPS access to the login pod with an Open Ondemand web GUI. * A single slurmd pod per Kubernetes worker node with automatic definition of slurm node memory and CPU configuration. * Slurm jobs run inside the slurmd pods, using host networking for maximum MPI performance. From 1ccca695982da0441f79c652740ef26df2b8efe8 Mon Sep 17 00:00:00 2001 From: Steve Brasier Date: Fri, 18 Aug 2023 14:04:10 +0000 Subject: [PATCH 5/6] update README title in line with repo name change --- README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/README.md b/README.md index 4dd3e70..311faba 100644 --- a/README.md +++ b/README.md @@ -1,4 +1,4 @@ -# Slurm Docker Cluster +# Slurm Kubernetes Cluster A Helm chart and Dockerfile to run a multi-container Slurm cluster on Kubernetes, featuring: From 848b34d22661b2cd282de549b3bbdd59c0b6a0f5 Mon Sep 17 00:00:00 2001 From: wtripp180901 <78219569+wtripp180901@users.noreply.github.com> Date: Mon, 4 Sep 2023 17:10:32 +0100 Subject: [PATCH 6/6] Removed references to generate-secrets.sh --- README.md | 14 -------------- 1 file changed, 14 deletions(-) diff --git a/README.md b/README.md index ba395af..7b91d9c 100644 --- a/README.md +++ b/README.md @@ -43,16 +43,6 @@ All config files in `slurm-cluster-chart/files` will be mounted into the contain ## Deploying the Cluster -### Generating Cluster Secrets - -On initial deployment ONLY, run -```console -./generate-secrets.sh [] -``` -This generates a set of secrets in the target namespace to be used by the Slurm cluster. If these need to be regenerated, see "Reconfiguring the Cluster" - -Be sure to take note of the Open Ondemand credentials, you will need them to access the cluster through a browser - ### Connecting RWX Volume A ReadWriteMany (RWX) volume is required for shared storage across cluster nodes. By default, the Rook NFS Helm chart is installed as a dependency of the Slurm cluster chart in order to provide a RWX capable Storage Class for the required shared volume. If the target Kubernetes cluster has an existing storage class which should be used instead, then `storageClass` in `values.yaml` should be set to the name of this existing class and the RookNFS dependency should be disabled by setting `rooknfs.enabled = false`. In either case, the storage capacity of the provisioned RWX volume can be configured by setting the value of `storage.capacity`. @@ -172,10 +162,6 @@ Generally restarts to `slurmd`, `slurmctld`, `login` and `slurmdbd` will be requ ### Changes to secrets -Regenerate secrets by rerunning -```console -./generate-secrets.sh -``` Some secrets are persisted in volumes, so cycling them requires a full teardown and reboot of the volumes and pods which these volumes are mounted on. Run ```console kubectl delete deployment mysql