-
Notifications
You must be signed in to change notification settings - Fork 20
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use controller-runtime to consume Machine API #89
Use controller-runtime to consume Machine API #89
Conversation
are we planning to migrate to controller runtime? |
Yes, @Bobbins228 and I are working on it. |
8d6d112
to
2a8ebf5
Compare
@Bobbins228 Thanks for this PR, did we do a manual testing of this PR? is scale-up and scale-down happening as expected? |
I tested this on an OSD cluster and I was able to successfully scale up and down but I did receive these warning logs which didn't appear before before my changes.
|
Is that related to the new CRD stuff? |
I am not sure but I think I recall seeing the same errors in MCAD too. |
No, it's an existing issue, that'll be fixed with project-codeflare/multi-cluster-app-dispatcher#456. |
controllers/appwrapper_controller.go
Outdated
if appwrapper.ObjectMeta.DeletionTimestamp.IsZero() { | ||
if !controllerutil.ContainsFinalizer(&appwrapper, finalizerName) { | ||
//onAdd replacement | ||
if appwrapper.Status.State == arbv1.AppWrapperStateEnqueued || appwrapper.Status.State == "" { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here, the logic of scaling up is guarded by the finalizer, but there is still a possibility that the update to add the finalizer fails, which can lead to duplicated scaling. There is the scaledAppwrapper
global variable, that holds an in-memory state of the scaled AppWrappers, but it's also fragile, as it's not persisted, for example in case the controller restarted.
I'd suggest to decouple the finalizer logic from the scaling invariant guard, and implement the later by relying on the persisted state in the cluster, e.g., by checking the existence of MachinePool / MachineSet with the corresponding AppWrapper label(s), that were atomically added during the creation of these resources.
A corollary would be to have the scaledAppwrapper
global variable removed altogether.
controllers/machineset.go
Outdated
@@ -109,11 +114,12 @@ func scaleMachineSet(aw *arbv1.AppWrapper, userRequestedInstanceType string, rep | |||
//TODO: user can delete appwrapper work on triggering scale-down | |||
klog.Infof("waiting for machines to be in state Ready. replicas needed: %v and replicas available: %v", replicas, ms.Status.AvailableReplicas) | |||
time.Sleep(1 * time.Minute) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The reconcile loop should not be blocked. Instead the reconcile request should be re-queued, so the state of the MachineSet is checked.
/retest |
controllers/appwrapper_controller.go
Outdated
// Only reason we are calling it here is that the client is not able to make | ||
// calls until it is started, so SetupWithManager is not working. | ||
if !useMachineSets && ocmClusterID == "" { | ||
getOCMClusterID(r) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd removed the call here, and get the cluster ID lazily, by calling the function where it's needed, and gate the retrieval of the ID with sync.Once
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried to implement this by calling it within the machinePoolExists()
function in SetupWithManager()
but this didn't work because the client can't be used until it is started.
I do notice that we check if the ocm secret exists before we call machinePoolExists()
. As I understand it we only need that ocm secret if we are scaling machine pools. If this is the case I can use that as the condition that sets useMachinePools
to true
and apply your suggestion in the other machinePool functions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So the call from the SetupWithManager
method fails because getOCMClusterID
uses the controller-runtime cached client to query the ClusterVersion API. This is not recommended as it'll keep watching the API for no reason, and should be changed to using the plain client-go / openshift client. Then it'll be possible to move the call to getOCMClusterID
in SetupWithManager
.
0b94968
to
e310af4
Compare
e310af4
to
b03cad3
Compare
854ca8d
to
8743b4e
Compare
8743b4e
to
450afa1
Compare
/lgtm |
/approve |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: astefanutti The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
Closes: #53
Verification Steps
OSD Cluster - Machine Pools
InstaScale should work as expected
Self Managed/OCP Cluster - MachineSets
InstaScale should work as expected
Note: At the moment when scaling up and down errors about updating the appwrapper will appear in the console due to an issue with applying/removing the finalizer