Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ADR-0003: TrustyAI Service Deployment using Operator #5

Open
wants to merge 10 commits into
base: main
Choose a base branch
from

Conversation

ruivieira
Copy link
Member

@ruivieira ruivieira commented May 22, 2023

This is a proposal for the deployment of the TrustyAI service using an Operator (ADR-0003).

Some questions (open for discussion) are:

  • Should the custom resource allow for custom TrustyAI images (e.g. for testing purposes), using tagged releases by default?
  • Should the operator check for ServiceMesh availability and use it if available?

@ruivieira ruivieira added ADR Architecture Decision Record ADR/under-discussion Architecture Decision Record under discussion labels May 22, 2023
@ruivieira ruivieira requested review from danielezonca and a team May 22, 2023 08:54
@ruivieira ruivieira self-assigned this May 22, 2023
@ruivieira ruivieira requested review from tteofili and RobGeada and removed request for a team May 22, 2023 08:54
@ruivieira ruivieira requested a review from taneem-ibrahim May 23, 2023 09:42
@ruivieira
Copy link
Member Author

ruivieira commented May 23, 2023

cc: @anishasthana @Jooho @etirelli Your feedback would be most valuable, thank you!
@cam-garrison @bartoszmajsak @InnocentK Your feedback, from a service mesh perspective, would also be most valuable, thank you!


If such configuration is not provided, the operator will use the default configuration.

### Route

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copying a comment from the google doc -- we should think through the servicemesh integration (if any) here too.


### Route

If deployed on OpenShift, the Operator will also create a `Route` object to expose the TrustyAI Service to external clients. The `Route` object will have the following configuration:

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does the service have auth enabled by default? If not, we will need to think about that too since otherwise TrustyAI data would be available to the wider internet by default

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@anishasthana Good point, thanks! (no it doesn't)

Copy link

@elmiko elmiko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice design doc. one thing i like to recommend when designing a new operator is considering what events you will emit as well. conditions, metrics, and logs are good to surface issues but also consider adding events to your operator to help understand what is doing.

i recommend this article as a good primer about some of the differences.


## Proposal

We propose to use a stand-alone TrustyAI Kubernetes Operator which would create and manage the required Deployment, Service, ConfigMap, Route, and ServiceMonitor resources based on a simple Custom Resource while keeping the state consistent with the desired one [^1].
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this sounds like a big win to me


* `replicas` is an optional field that specifies the number of replicas of the TrustyAI service that you want to run. If not provided, the default is one replica.
* `storage` is a mandatory field that specifies the storage details. It has two nested fields:
* `format` - the storage format, (example: a Persistent Volume Claim (PVC)).
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a specific AccessMode PVC needed? (RWX,RWO)

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • Is it using default StorageClass?
  • So default StorageClass will be a prerequisite?
  • If there is no default storageClass, what will happen?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Jooho Good point, I'll add this info.
RWO is what the manifests have been specifying so far, so I think we could keep with that.
Regarding the StorageClass, the initial implementation disables dynamic provisioning and binds to already existing PVs.
Along the lines of

spec:
  storage:
    pv: "mypv"
    size: 1Gi

```


Note that TrustyAI isn't currently implementing HTTPS endpoints, so the `tls` field will be set to `null` for now. Once HTTPS is implemented, the `tls` field will be updated to include the TLS configuration.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can use Edge TLS. Is there a reason to use reencrypt or passthough TLS?


### ModelMesh Serving Integration

The operator also ensures the correct configuration of the ModelMesh Serving component. Once the TrustyAI Service is deployed and reachable, the operator will patch the ModelMesh Serving configuration to include a custom payload processor and it will be configured to point to the consumer endpoint of the deployed TrustyAI Service.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if this is really OK to do.
When installing ModelMesh via operator, this patch may be rolled back, no?

@ruivieira
Copy link
Member Author

@anishasthana @danielezonca Related to the question of custom images, a new section was added on how to provide custom service images.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ADR/under-discussion Architecture Decision Record under discussion ADR Architecture Decision Record
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants