Kubeflow is a K8s native tool that eases the Deep Learning and Machine Learning lifecycle.
Kubeflow allows users to request specific resources (such as number of GPUs and CPUs), specify Docker images, and easily launch and develop through Jupyter models. Kubeflow makes it easy to create persistent home directories, mount data volumes, and share notebooks within a team.
Kubeflow also offers a full deep learning pipeline platform that allows you to run, track, and version experiments. Pipelines can be used to deploy code to production and can include all steps in the training process (data prep, training, tuning, etc.) each done through different Docker images. For some examples reference the examples directory.
Additionally Kubeflow offers hyper-parameter tuning options.
Kubeflow is an open source project and is regularly evolving and adding new features.
As part of the Kubeflow installation, the MPI Operator will also be installed. This will add the MPIJob
CustomResourceDefinition to the cluster, enabling multi-pod or multi-node workloads. See here for details and examples.
Deploy Kubernetes by following the DeepOps Kubernetes Deployment Guide
Kubeflow requires a DefaultStorageClass to be defined. By default DeepOps installs the nfs-client-provisioner
using the nfs-client-provisioner.yml playbook. This playbook can re run manually. As an NFS alternative Ceph, Trident or an alternative StorageClass can be used.
Deploy Kubeflow:
# Deploy (using istio configuration)
./scripts/k8s/deploy_kubeflow.sh
Deploy Kubeflow with Dex and SSO integration:
# Deploy (using istio_dex configuration)
./scripts/k8s/deploy_kubeflow.sh -x
See the install docs for additional install configuration options.
Kubeflow configuration files will be saved to ./config/kubeflow-install
.
The kfctl binary will be saved to ./config/kfctl
. For easier management this file can be copied to /usr/local/bin
or added to the PATH
.
The services can be reached from the following address:
- Kubeflow: http://<kube-master>:31380
The default username is [email protected]
and the default password is 12341234
.
These can be modified at startup time following the steps outlined here.
For the most up-to-date usage information run ./scripts/k8s/deploy_kubeflow.sh -h
.
./scripts/k8s/deploy_kubeflow.sh -h
Usage:
-h This message.
-p Print out the connection info for Kubeflow.
-d Delete Kubeflow from your system (skipping the CRDs and istio-system namespace that may have been installed with Kubeflow.
-x Install Kubeflow with multi-user auth (this utilizes Dex, the default is no multi-user auth).
-c Specify a different Kubeflow config to install with (this option is deprecated).
-w Wait for Kubeflow homepage to respond (also polls for various Kubeflow Deployments to have an available status).
To uninstall and re-install Kubeflow run:
./scripts/k8s/deploy_kubeflow.sh -d
./scripts/k8s/deploy_kubeflow.sh
To modify the Kubeflow configuration, modify the downloaded CONFIG
YAML file in config/kubeflow-install/
or one of the many overlay YAML files in config/kubeflow-install/kustomize
.
After modifying the configuration, apply the changes to the cluster using kfctl
:
cd config/kubeflow-install
../kfctl apply -f kfctl_k8s_istio.yaml
A common issue with Kubeflow installation is that no DefaultStorageClass has been defined or that Ceph has been not been deployed correctly.
This can be identified if most of the Kubeflow Pods are running and the MySQL pod and several others remain in a Pending state. The GUI may also load and throw a "Profile Error". Run the following to debug further:
kubectl get pods -n kubeflow
NOTE: Everything should be in a running state.
If nfs-client-provisioner
was used as the Default StorageClass verify it is running and set:
helm list | grep nfs-client
kubectl get storageclass | grep default
NOTE: If NFS is being used, the helm application should be in a
deployed
state andnfs-client
should be the default StorageClass.
If Ceph was installed, verify it is running:
./scripts/k8s/deploy_rook.sh -w
kubectl get storageclass | grep default
NOTE: If Ceph is being used,
deploy_rook.sh -w
should exit after several seconds and Ceph should be the default StorageClass.
To correct this issue:
- Uninstall Rook/Ceph:
./scripts/k8s/deploy_rook.sh -d
- Uninstall Kubeflow:
./scripts/k8s/deploy_kubeflow.sh -d
- Re-install Rook/ceph:
./scripts/k8s/deploy_rook.sh
- Poll for Ceph to initialize (wait for this script to exit):
./scripts/k8s/deploy_rook.sh -w
- Re-install Kubeflow:
./scripts/k8s/deploy_kubeflow.sh