22.04.1
DeepOps 22.04.1 Release Notes
Bugfix release
NVIDIA rotated the signing keys for the CUDA repositories on April 27, breaking installs from DeepOps 22.04 released a few days prior.
This release, 22.04.1, starts from 22.04 and adds PR #1167 to handle the updated key.
The previous release notes from 22.04 appear below.
Known Issues
- Kubeflow deployment is currently broken due to incompatibility between current Kubeflow and Kubernetes 1.22. Kubeflow deployment will be updated to add support when Kubeflow releases 1.6. See #1147.
General
- Extensive improvements to automated testing with Jenkins, Ansible Molecule, and ansible-lint
- Update MIG playbook to use the new nvidia-mig-manager systemd service
- Updates to roles for nvidia-docker and GPU driver
- Various bug fixes
Slurm
- Enhanced NCCL tests for Slurm cluster validation
- Make use of pam_slurm_adopt optional
- Break out multiple sections in Slurm inventory file
Kubernetes
- Update to Kubernetes 1.22.6
- Update default container runtime from dockershim to containerd
- Add support for NVIDIA Network Operator
- Add support to deploy NVIDIA Deep Learning Examples on Kubernetes clusters
- Update to GPU Operator 1.9
Changes
Bugs/Enhancements
- Fixes for rsyslog server role (#1096, #1098)
- Update NetApp Trident default version number and branding (#1105)
- Introduce a common script library (#953)
- Update versions of monitoring stack components (#1107)
- Updates to Jenkins testing (#1112, #1127, #1133, #1137, #1138, #1139, #1150, #1151)
- Fixes for setup script (#1114)
- Automated testing of DeepOps roles using Molecule (#1094, #1116, #1158)
- Update nvidia.nvidia_docker role to v1.2.4 (#1121)
- Automated deployment of Deep Learning Examples (#1083, #1145)
- Make it optional to use pam_slurm_adopt (#1111)
- Convert MIG playbook to use nvidia-mig-manager service (#1106)
- Update to GPU Operator 1.9 (#1074)
- Automatically run ansible-lint on each role (#1129)
- Update Kubeflow deployment script to Kubeflow 1.4 (#1104)
- Remove old build dirs during Slurm upgrade (#1101)
- Fixes to ood-wrapper role (#1125)
- Documentation of network ports (#1126)
- Set missing defaults in playbooks (#1134)
- Update to Kubespray v2.18.1 and containerd (#1043, #1141)
- Fix GPU Operator config (#1136)
- Break out functional host groups in Slurm inventory (#1087)
- Fix ordering in k8s cluster deployment (#1128)
- Update nvidia.nvidia_driver role to v2.2.0 (#1143, #1160)
- Add support for NVIDIA Network Operator (#1113, #1156)
- Enhanced NCCL tests for Slurm validation (#1042)
- Fix git.io shortlinks (#1163)
- Check for SELinux disabled in SELinux tasks (#1162)
Upgrade Steps
If you are upgrading to this version of DeepOps from a previous release you will need to follow the upgrade section of the Slurm or Kubernetes Deployment Guides. In addition to this, the ./scripts/setup.sh
script must be re-run and any new variables in the config.example files should be added to the existing config. For a full diff from release 22.01
run git diff 22.01 22.04 -- config.example/
. If you encounter problem please open a GitHub issue. See the update guide for additional guidance.