Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DRAFT: Add watcher to controller and webhook #33

Open
wants to merge 48 commits into
base: dev
Choose a base branch
from

Conversation

ChristianZaccaria
Copy link

@ChristianZaccaria ChristianZaccaria commented May 30, 2024

What type of PR is this?

/kind bug

What this PR does / why we need it:

  • Added a watcher for RayCluster CRD then start controller and webhook for Job Framework.
  • I followed the similar approach as done in the CodeFlare-Operator.

Which issue(s) this PR fixes:

Jira: https://issues.redhat.com/browse/RHOAIENG-7887

Special notes for your reviewer:

In this PR, the controller and webhook for Job Framework won't start unless the raycluster CRD is available. Keep in mind, this PR is currently not watching for the PyTorchJob CRD, hence, will start the controller and webhook anyways. - Please verify if we are okay with this.

Does this PR introduce a user-facing change?


Change-Id: Iaad4e043f8c295edbe16fa08f8ad431d8d96d8e5

Co-authored-by: Aldo Culquicondor <[email protected]>
Change-Id: I325b633421e009cd3f7ae5b48ddcf0d3da46741d
Change-Id: I2b2aa88eca55d3385cb34cf141b5da3c902c28d7

Co-authored-by: Aldo Culquicondor <[email protected]>
* Don't deploy visibility by default

* Review Remarks

* Review Remarks

* Review Remarks

---------

Co-authored-by: Traian Schiau <[email protected]>
* Parallelize integration tests

* Create parameter INTEGRATION_NPROCS

* test INTEGRATION_NPROCS=2

* test INTEGRATION_NPROCS=8

* revert INTEGRATION_NPROCS

---------

Co-authored-by: Gabe <[email protected]>
Use a custom service account and role when generating
the multikueue kubeconfigs.

Co-authored-by: Traian Schiau <[email protected]>
…t. (kubernetes-sigs#1809)

* Cluster connection monitoring and reconnect.

* Review Remarks

---------

Co-authored-by: Traian Schiau <[email protected]>
Change-Id: I5e44372ee23b253c6eeb2a71c8a97dd0370e5072

Co-authored-by: Aldo Culquicondor <[email protected]>
…-sigs#1832)

* Avoid requeue on ClusterQueue status update

Change-Id: Id028414a730a238549b7fddf5d42a2eafdf959c3

* Reclaimable pods can put pod out of inadmissible

Change-Id: Id52ef3c92abd84317c4a7a71c3a20f6f28d67d25

* Delay requeueing based on RequeueState

Change-Id: I5431e741cb30541a846b65fb26aa0b7b631eb80d

---------

Co-authored-by: Aldo Culquicondor <[email protected]>
Change-Id: If8f430731a6967a361d0f5172ed408f57a0687bc

Co-authored-by: Aldo Culquicondor <[email protected]>
…sigs#1830) (kubernetes-sigs#1836)

* [e2e] Return ginkgo exit code

* [e2e] Restore manager's image in sh

* Cleanup and remarks

Change-Id: I66ccb5173149458dfcdd64d117dd81cd3ded9c5c

Co-authored-by: Traian Schiau <[email protected]>
…d=false (kubernetes-sigs#1845)

Change-Id: Ib10c1874bfb3f5810f83d751e4d80ca6eb031bfa
Change-Id: I7cc659b2b013cd4a0b907b0e3002b877dac2e4ae
)

Change-Id: I69584dd95c57539e163067a4fb93cdb32fc57461

Co-authored-by: Aldo Culquicondor <[email protected]>
…kubernetes-sigs#1880)

Change-Id: I68155a467880059c65deb75b00939b14551924cd

Co-authored-by: Aldo Culquicondor <[email protected]>
…s-sigs#1917)

* Dereference StopPolicy before equality check

* Update pkg/controller/core/workload_controller.go

Co-authored-by: Yuki Iwai <[email protected]>

* Update pkg/controller/core/workload_controller.go

Co-authored-by: Yuki Iwai <[email protected]>

---------

Co-authored-by: Gabe <[email protected]>
Co-authored-by: Yuki Iwai <[email protected]>
Change-Id: I7f17d51df66de6766af44e48f45b4165e8bd3e30

Co-authored-by: Aldo Culquicondor <[email protected]>
Change-Id: Icabeb0cca6f7680173d72be29a293731f93d8f9b
@openshift-ci openshift-ci bot requested review from dimakis and MichaelClifford May 30, 2024 15:15
Copy link

openshift-ci bot commented May 30, 2024

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: ChristianZaccaria
Once this PR has been reviewed and has the lgtm label, please assign jbusche for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@ChristianZaccaria ChristianZaccaria force-pushed the watch-crds-raycluster-pytorchjob branch from 2736a06 to 5bec433 Compare May 30, 2024 15:30
Copy link

openshift-ci bot commented May 30, 2024

@ChristianZaccaria: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/e2e-odh-kueue 5bec433 link true /test e2e-odh-kueue

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@ChristianZaccaria ChristianZaccaria changed the title Add watcher to controller and webhook DRAFT: Add watcher to controller and webhook Jun 5, 2024
crdClient, err := apiextensionsclientset.NewForConfig(mgr.GetConfig())
exitOnError(err, "unable to create CRD client")

crdList, err := crdClient.ApiextensionsV1().CustomResourceDefinitions().List(ctx, metav1.ListOptions{})

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why not use the restmapping?

Comment on lines +47 to +48
pytorchjobAPI = "pytorchjobs.kubeflow.org"
rayclusterAPI = "rayclusters.ray.io"

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can't these be obtained from the gvk?

@openshift-merge-robot
Copy link

PR needs rebase.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

replicas: 1
template:
metadata:
annotations:
kubectl.kubernetes.io/default-container: manager
labels:
control-plane: controller-manager

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are usually a part of scaffolding, so I assume they would be added back again?

return fmt.Errorf("%s: %w", fwkNamePrefix, err)
}
logger.Info("No matching API in the server for job framework, skipped setup of controller and webhook")
if !isAPIAvailable(context.TODO(), mgr, rayclusterAPI) {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just curious - So if the API is not available, is it by requirement that we don't proceed with starting the controller, or do we want to start the controller with informers on existing GVRs any way?

Asking this because, if we want to do the latter: instead of setting up watches and waiting for APIs to become available by blocking the manager to start, we could use discovery client at the time of setup, check for API availability, and set up watches accordingly.

Setting up informers prior or later in the process depends on whether we want the CRDs to be installed as a requirement or make it optional.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.