21.03
supertetelman
released this
11 Mar 22:33
·
1 commit
to release-21.03
since this release
DeepOps 21.03 Release Notes
What's New
General
- Rsyslog client/server for K8s & Slurm deployments
- Examples for running Ansible and configuring Inventory file
- Improved support for Ubuntu 20.04 and CentOS 8
- Docker login convenience playbook
- Marked air-gap as "experimental"
- Vagrant/virtual 2.2.14 (previously 2.2.3)
Slurm
- Slurm version 20.11.3 (previously 20.02.4)
- HPC SDK 21.2 (previously 2020_207)
K8s
- Helm version v3.4.1 (previously v3.1.2)
- NFS Client Provisioner as K8s Default StorageClass
- GPU Operator v1.5.2(previously v1.1.7)
- GPU Device Plugin v0.8.2 (previously v0.7.0)
- GPU Feature Discovery v0.4.1 (previously v0.2.0)
- Example NGC Dockerfiles bumped to 20.12 with improved documentation
- New example yaml files for launching single node/multi node training and jupyter notebooks
- RoCE perfromance playbook
Changes
- Deprecation of Rook-Ceph deployment script
- Removed default MPI Operator install for K8s
- NFS server is now deployed on kube-master[0] by default with path /export/deepops_nfs
- New log bundling tool (debug.sh) for K8s
- Enroot marked as "not fully automated" for CentOS (simple workaround is to bump enroot Ansible Galaxy role from v0.3.2 to v0.4.0 and re-run setup.sh)
Bugs/Enhancements
- K8s monitoring metrics now persist by default using NFS-backed PVs.
- Additional testing for Ubuntu 20.04, CentOS 8, GPU Operator, enroot, mpi, and testing.md
- Addressed firewall issues in CentOS
- Add vGPU support for GPU Operator installs
- Address intermittent download failures in Slurm install
Upgrade steps
If you are upgrading to this version of DeepOps from a previous release you will need to follow the upgrade section of the Slurm or Kubernetes Deployment Guides. In addition to this, the ./scripts/setup.sh
script must be re-run and any new variables in the config.example files should be added to the existing config. For a full diff from release 20.12
run git diff 20.12 21.03 -- config.example/
. Note, the majority of the config changes are around new functionality such as nfs-client-provisioner, rsyslog, and persistent monitoring metrics in K8s. If you encounter problem please open a GitHub issue. See the update guide for additional guidance.