This use case is centered on using OpenShift Virtualization to implement an end-to-end AI use case on OpenShift. For this we're replicating the Manuela Visual Inspection demo that was developed by Stefan Bergstein.
- We're following the Manuela Visual Inspection demo instructions (slides).
- We're using an Equinix Metal OpenShift Cluster deployed through RHDP (OpenShift AIO) for the CNV / data labelling use case. Its specs are:
- 3 controller nodes : 8cores / 24Gb memory
- 3 worker nodes : 16cores / 50Gb memory
- We're using an SNO instance deployed on a GPU node for the model training and serving use case. The node specs are:
- 24cores / 128Gb memory
- NVidia Tesla M60 GPU
CVAT is the Computer Vision Annotation Toolkit maintained by the OpenCV community and allows to annotate images through a graphical user interface. In the demo it's used to showcase data labeling in a visual inspection use case.
CVAT is a containerized application. However, it's not compatible with Kubernetes, instead it's intended to be deployed through docker-compose. To still run it on OCP, next to the ML development and deployment environment, we're using OpenShift Virtualization to host CVAT in an OCP-based VM.
- Provision OpenShift AIO (Equinix Metal) with OpenShift Virtualization Lab RHDP item.
- Provision the virtualization cluster.
- SSH into the bastion host.
- Download
virtctl
to the bastion host through the OCP cluster (check out Command Line Tools).- As we don’t have a valid CA cert, we need to use
wget –no-check-certificate
to handlehttps
. - Unpack the archive and copy the binary to
/usr/local/bin
.
- As we don’t have a valid CA cert, we need to use
Follow the demo setup instructions. Additional notes below.
- Create project
hackathon
. - Provision a VM:
- CentOS Stream 8
- project
hackathon
- customize the deployment: 2 CPU, 8 GB RAM
- use PVC as per template
Follow the demo setup instructions. Additional notes below.
- Install
docker-compose
. - Clone the CVAT repository.
- NOTE: Check out version 1.4.0 as the more recent versions are deployed differently than documented in the demo instructions.
- Add image tag
v1.4.0
to the CVAT service container entries. - Start CVAT through
docker-compose
. - Expose the service on the bastion host using
virtctl
.- Ensure the VM name is correct, replace it if necessary.
- Expose the service externally using a
route
. - Log into the app admin console and add users. Proceed with the demo as documented.
Using the annotated image data we're training a machine learning model for detecting scratches and bents in metal nut images. The training is run within a JupyterLab instance in OpenShift Data Science (RHODS). Running the training algorithm on a GPU node is recommended but not required.
- Deploying SNO on GPU node using Assisted Installer (with LMVS installation option).
- Installing Node Feature Discovery Operator.
- Create
NodefeatureDiscovery
instance:nfd-instance
- Node Feature Discovery Operator uses vendor PCI IDs to identify hardware in our node.
0x10de
is the PCI vendor ID that is assigned to NVIDIA:$ oc describe node | egrep 'Roles|pci' Roles: control-plane,master,worker feature.node.kubernetes.io/pci-102b.present=true feature.node.kubernetes.io/pci-10de.present=true feature.node.kubernetes.io/pci-14e4.present=true
- Create
- Installing NVIDIA GPU Operator.
- Create ClusterPolicy:
gpu-cluster-policy
- Both GPUs are detected:
$ oc describe node| sed '/Capacity/,/System/!d;/System/d' Capacity: cpu: 24 ephemeral-storage: 936104940Ki hugepages-1Gi: 0 hugepages-2Mi: 0 memory: 131556044Ki nvidia.com/gpu: 2 pods: 250
- Create ClusterPolicy:
- Installing Red Hat OpenShift Data Science Operator.
- On platforms that do not provide shareable object storage, the OpenShift Image Registry Operator bootstraps itself as
Removed
. This allowsopenshift-installer
to complete installations on these platform types. After installation, you must configure the storage and edit the Image Registry Operator configuration to switch the managementState fromRemoved
toManaged
.- Create PVC:
image-registry-storage
- Enable Image registry
- Create PVC:
- Create Data Science Project:
hackathon-ds
- Create Workbench:
hackathon-wb
(CUDA
image) - Create Cluster Storage:
hackathon-cs
- GPU: not shown (but will use it) - BUG???
- On platforms that do not provide shareable object storage, the OpenShift Image Registry Operator bootstraps itself as
- Enter workbench.
- Clone Git Repository.
- Run notebook:
manuela-visual-inspection/ml/pytorch/Manuela_Visual_Inspection_Yolov5_Model_Training.ipynb
The actual visual inspection use case with "live" data is implemented through a simulated camera data source that streams metal nut images onto a Kafka topic. A serverless ML client consumes these images and sends them to an ML service managed by RHODS. A dashboard application visualizes the metal nut images and the defects that are detected by the deployed ML model.
The model can be deployed within a separate environment, in principle even on MicroShift (reflecting the likely customer scenario). During the hackathon we chose to reuse the SNO instance that we used for the ML training part.
Follow the instructions as per demo documentation. Additional comments below.
- Deploy Minio for S3 storage.
- Call the Minio dashboard route and log in (
minio
/minio123
). Create a bucket with namemodels
.
- Call the Minio dashboard route and log in (
- Deploy the trained model with RHODS model serving.
- In the RHODS dashboard, navigate to project
hackathon-ds
. - Configure the model server.
- Download the trained model. Choose
manu-vi-best-yolov5m.onnx
for the best performance. - Upload the model file into the
models
bucket through the Minio GUI. - Select
Deploy model
in the RHODS dashboard. Specify the data connection parameters:- name: models
- access key:
minio
- secret key:
minio123
- S3 endpoint URL:
http:minio-service.minio.svc:9000
- bucket name:
models
- model format:
ONNX
- model path:
manu-vi-best-yolov5m.onnx
- Validate the model deployment through the inference notebook.
- In the RHODS dashboard, navigate to project
- Deploy Kafka.
- Deploy the AMQ Streams operator version 2.3.x.
- Deploy the Kafka cluster manifest.
- Deploy the Kafka topic manifest.
- Deploy OpenShift Serverless.
- Deploy the operator.
- Instantiate
KnativeServing
,KnativeEventing
,KnativeKafka
.- In the
KnativeKafka
specs, set the following flags:spec.broker.enabled
:true
spec.channel.enabled
:true
spec.sink.enabled
:true
spec.source.enabled
:true
- In the
- Deploy the ML application.
- Deploy the build configs and imagestreams.
- Fix the API versions of the imagestreams. They should be:
apiVersion: image.openshift.io/v1
- Fix the API versions of the imagestreams. They should be:
- Deploy the
KafkaSource
.- Update the Kafka bootstrap server URL.
- Ensure the
manuela-visual-inspection
namespace is used.
- Deploy the
KafkaTriggers
. - Call the visual inspection dashboard to see real-time scratch and bent detection.
- Call the dashboard through HTTP instead of HTTPS.
- You may want to double check the Network Policies to ensure the
image-processor
can send requests to the deployed model within thehackathon-ds
namespace.
- Deploy the build configs and imagestreams.