Skip to content

kalenpeterson/dgx-setup

Repository files navigation

DGX-Lambda Node Setup

This Document details the procedure to add prepare DGX/Lambda nodes to be added to a cluster.

Related Repos

These are other Repos related to this project that contain their own documentation and tooling.

Git Repository Description
kube-slurm Tools for integrating Slurm and Kubernetes
dgx-chargeback Tools to manage GPU Chargeback of kube-slurm clusters

Document Index

Document Description Version Info
New Cluster Guide to deploying a new Kube Cluster Kalen Peterson, Dec 2020
Add Node Guide to adding DGX or Lambda nodes to the Cluster Kalen Peterson, June 2023
Repair Cluster Guide to repair/re-add a broken Master node in the cluster Kalen Peterson, April 2022

Tool Index

Tool Description Version Info
cluster-setup.yaml Ansible Playbook to configure base cluseter nodes (master/node) Kalen Peterson, June 2023
configure-firewall.yaml Ansible playbook to configure firewall on master nodes Kalen Peterson, April 2021
renew-cluster-certs.yaml Ansible playbook to renew Kubernetes cluster certificates Kalen Peterson, April 2021
setup-kubectl.yaml Ansible playbook to configure kubetl and distribure kubeconfig Kalen Peterson, April 2021
renew-cluster-certs.yaml Ansible playbook to renew Kubernetes cluster certificates Kalen Peterson, April 2021
restart_cluster_services.sh Shell script to restart all Kube/Slurm services Kalen Peterson, June 2023
podman_reset.sh Shell script to Setup/Reset a User's Podman configuration Kalen Peterson, April 2022
cluster-repair.yaml Ansible playbook to repair/re-add broken master nodes Kalen Peterson, April 2022

References

URL Description
https://github.com/NVIDIA/deepops Nvidia deepops project, source for developing this cluster

About

DGX Node Initial Setup

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published