-
Notifications
You must be signed in to change notification settings - Fork 6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ADR-0003: TrustyAI Service Deployment using Operator #5
base: main
Are you sure you want to change the base?
Conversation
cc: @anishasthana @Jooho @etirelli Your feedback would be most valuable, thank you! |
adr/ADR-0003-trustyai-service-deployment-using-operator-pattern.md
Outdated
Show resolved
Hide resolved
|
||
If such configuration is not provided, the operator will use the default configuration. | ||
|
||
### Route |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Copying a comment from the google doc -- we should think through the servicemesh integration (if any) here too.
|
||
### Route | ||
|
||
If deployed on OpenShift, the Operator will also create a `Route` object to expose the TrustyAI Service to external clients. The `Route` object will have the following configuration: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does the service have auth enabled by default? If not, we will need to think about that too since otherwise TrustyAI data would be available to the wider internet by default
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@anishasthana Good point, thanks! (no it doesn't)
adr/ADR-0003-trustyai-service-deployment-using-operator-pattern.md
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nice design doc. one thing i like to recommend when designing a new operator is considering what events you will emit as well. conditions, metrics, and logs are good to surface issues but also consider adding events to your operator to help understand what is doing.
i recommend this article as a good primer about some of the differences.
|
||
## Proposal | ||
|
||
We propose to use a stand-alone TrustyAI Kubernetes Operator which would create and manage the required Deployment, Service, ConfigMap, Route, and ServiceMonitor resources based on a simple Custom Resource while keeping the state consistent with the desired one [^1]. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this sounds like a big win to me
|
||
* `replicas` is an optional field that specifies the number of replicas of the TrustyAI service that you want to run. If not provided, the default is one replica. | ||
* `storage` is a mandatory field that specifies the storage details. It has two nested fields: | ||
* `format` - the storage format, (example: a Persistent Volume Claim (PVC)). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a specific AccessMode PVC needed? (RWX,RWO)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Is it using default StorageClass?
- So default StorageClass will be a prerequisite?
- If there is no default storageClass, what will happen?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Jooho Good point, I'll add this info.
RWO is what the manifests have been specifying so far, so I think we could keep with that.
Regarding the StorageClass, the initial implementation disables dynamic provisioning and binds to already existing PVs.
Along the lines of
spec:
storage:
pv: "mypv"
size: 1Gi
adr/ADR-0003-trustyai-service-deployment-using-operator-pattern.md
Outdated
Show resolved
Hide resolved
``` | ||
|
||
|
||
Note that TrustyAI isn't currently implementing HTTPS endpoints, so the `tls` field will be set to `null` for now. Once HTTPS is implemented, the `tls` field will be updated to include the TLS configuration. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can use Edge TLS. Is there a reason to use reencrypt or passthough TLS?
|
||
### ModelMesh Serving Integration | ||
|
||
The operator also ensures the correct configuration of the ModelMesh Serving component. Once the TrustyAI Service is deployed and reachable, the operator will patch the ModelMesh Serving configuration to include a custom payload processor and it will be configured to point to the consumer endpoint of the deployed TrustyAI Service. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if this is really OK to do.
When installing ModelMesh via operator, this patch may be rolled back, no?
@anishasthana @danielezonca Related to the question of custom images, a new section was added on how to provide custom service images. |
This is a proposal for the deployment of the TrustyAI service using an Operator (ADR-0003).
Some questions (open for discussion) are: