Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

🌱 Update rollout lib #276

Merged

Conversation

serngawy
Copy link
Member

Summary

Related issue(s)

Fixes #

@openshift-ci openshift-ci bot requested review from mdelder and qiujian16 September 11, 2023 21:39
@serngawy
Copy link
Member Author

/assign @haoqing0110

@serngawy
Copy link
Member Author

Adding
@dhaiducek

if needToRollout {
rolloutClusters[cluster] = status
rolloutClusters = append(rolloutClusters, status)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need to check the rolloutClusters length I think, should not put all the needToRollout existing clusters into the result.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This iterate on the existing clusterRolloutStatus, Assuming it shouldn't never reach the length. okay will add it for safety.

Copy link
Member

@haoqing0110 haoqing0110 Sep 13, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For addon, there's the possibility that 3 managedclusteraddons are created(existing) and status is "ToApply", if don't check the length, all of the 3 will be returned.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm assuming the existingClusterStatus hold only clusterStatus; progressing, succeeded, failed and timeout cause the clusterStatus ToApply means the cluster workload does not exist yet. So as addon controller you should not add clusterStatus ToApply to the existingClusterStatus cause the addon (workload) does not exist. What do you think ?
I will add check to ignore the ToApply clusterStatus at the existingClusterStatus loop.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added the check for the ToApply status, so if the existingClusterStatus has all clusters in ToApply status, the RolloutResult will contain only the required maxConcurrency or the required decisionGroup. Added unitTest as well for this case.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the ToApply means the desired status not applied yet, no matter it's a fresh install (no current workload) or upgrade case (current workload does not match the desired workload) .

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, I added the check for the ToApply state as well.


clusterGroupKeys := clusterGroupsMap.GetOrderedGroupKeys()
for _, status := range existingClusterStatus {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function is to rollout by group, why go through the existingClusterStatus and put it to result? what will the returned rolloutClusters be like?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it return the clusters that are still progressing, failing (before the timeout) and new clusters add to the group after the rollout starts

Copy link
Member

@haoqing0110 haoqing0110 Sep 13, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still have some concerns about going through the existingClusterStatus first and then going through the rest of the clusters, for both progressivePerCluster and progressivePerGroup.

For example, the env has 100 clusters with status Succeed, then rollout starts, what's the expected cluster status returned by existingClusterStatus? For addon existingClusterStatus will return ToApply when rollout starts, in this case, all the clusters will be returned by GetRolloutCluster(), which doesn't make sense.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand your point, same as above let me know your thoughts.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another gap after this change is that, in addon, one cma will have multiple placement, there's placement overlap. For example:
The config-v1 existingClusterStatus has cluster1 and config-v2 existingClusterStatus has cluster2, cluster3. For config-v1, cluster2 is not in existing cluster and it should also not be treated as added cluster and return in rollout result.

  installStrategy:
    type: Placements
    placements:
    - name: placement1 // placement1 select cluster1 and cluster2
      namespace: default
      configs:
      - group: addon.open-cluster-management.io
        resource: addonhubconfigs
        name: config-v1
      rolloutStrategy:
        type: Progressive
    - name: placement2 // placement2 select cluster2 and cluster3
      namespace: default
      configs:
      - group: addon.open-cluster-management.io
        resource: addonhubconfigs
        name: config-v2
      rolloutStrategy:
        type: Progressive

I'm thinking should we just rollout the clusters returned by ClusterRolloutStatusFunc and put added/deleted cluster in another field.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The removed (deleted) clusters are already in another field ClustersRemoved .
Regards the added clusters, not sure if I understand the example above correctly. Config-v1 is associated with placement1 and config-v2 is associated with placement2; those are 2 different configs with 2 different placements. However; as cluster-2 is associated with config-v1 and config-v2, cluster-2 must apply config-v2 as it has higher version. Is that correct ?
If yes, I guess this is more addon logic that is need to be handled inside the addon controller. what do you think ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addon has returned a SKIP status for cluster2 when rollout config-v1 to indicate not to rollout it. Not sure if PR can handle the case when cluster2 is not in existingClusterStatus and user want to SKIP rollout it.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so the idea for existingClusterStatus; if the the existing cluster has Skip status it will not be considered in the rolloutResult->ClustersToRollout. The addon set the cluster2 ClusterRolloutStatus to skip (for configv1 and placement1) then the rolloutResult->ClustersToRollout will not have it. I added comments here as well

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, addon might need to put SKIP cluster in existingClusterStatus so that won't put it in ClustersToRollout. Sounds good.

@serngawy serngawy force-pushed the rolloutLib branch 3 times, most recently from b6551e1 to c92b89c Compare September 13, 2023 15:13
Copy link
Member

@dhaiducek dhaiducek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the update! I've left some minor comments.

(Regarding the current discussion about removed clusters: In general, I like having the removed clusters available to the controller so that they can be handled right away, but I'm still processing the existing discussion about it.)

}

// ClusterRolloutStatusFunc defines a function to return the rollout status for a managed cluster.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// ClusterRolloutStatusFunc defines a function to return the rollout status for a managed cluster.
// ClusterRolloutStatusFunc defines a function to return the rollout status for a managed cluster for a given workload.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Comment on lines -58 to +62
// ClustersToRollout is a map where the key is the cluster name and the value is the ClusterRolloutStatus.
ClustersToRollout map[string]ClusterRolloutStatus
// ClustersTimeOut is a map where the key is the cluster name and the value is the ClusterRolloutStatus.
ClustersTimeOut map[string]ClusterRolloutStatus
// ClustersToRollout is a slice of ClusterRolloutStatus that will be rolled out.
ClustersToRollout []ClusterRolloutStatus
// ClustersTimeOut is a slice of ClusterRolloutStatus that are timeout.
ClustersTimeOut []ClusterRolloutStatus
// ClustersRemoved is a slice of ClusterRolloutStatus that are removed.
ClustersRemoved []ClusterRolloutStatus
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm assuming this is an array now so that the mapping to each cluster's name is clearer to the end user? I kind of liked the map[string]ClusterRolloutStatus for uniqueness and simplicity.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand its easier to iterate with a map but with the changes that happen we don't really need the clusterName to be in a map. Plus its better to move the clusterName to the ClusterRolloutStatus struct to be identifiable object by itself.

// Calculate the length for progressive rollOut
// If the MaxConcurrency not defined, total clusters length is considered as maxConcurrency.
clusterGroups = r.pdTracker.ExistingClusterGroupsBesides(groupKeys...)
length, err := calculateLength(strategy.Progressive.MaxConcurrency, len(clusterGroups.GetClusters()))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Aside: This is outside of this PR's intent, but could we consider renaming calculateLength() to something more descriptive like calculateRolloutSize()?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

for _, cluster := range clusters {
if clusterStatus.ClusterName == cluster {
exist = true
currentClusterStatus = append(currentClusterStatus, clusterStatus)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should there be a break here so that the loop doesn't continue?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, done

}

existingClusters[status.ClusterName] = true
if status.Status == Succeeded || status.Status == TimeOut {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Optional: This could be a switch statement.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

okay, done.

Comment on lines 278 to 281
if len(rolloutClusters) >= length {
return RolloutResult{
ClustersToRollout: rolloutClusters,
ClustersTimeOut: timeoutClusters,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't this check redundant since it's already in the loop above?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, removed.

timeoutClusters := []ClusterRolloutStatus{}
existingClusters := make(map[string]bool)

for _, status := range existingClusterStatus {
Copy link
Member

@haoqing0110 haoqing0110 Sep 15, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should here sort the existingClusterStatus by name by alphabetical order to ensure every time returns the same result?

// Set as false to consider the cluster in the decisionGroups iteration.
existingClusters[status.ClusterName] = false
case Failed, Progressing:
newStatus, needToRollout := determineRolloutStatusAndContinue(status, timeout)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can here combine the switch case and determineRolloutStatusAndContinue's switch case into one function?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

okay, done.

// +k8s:deepcopy-gen=false
type RolloutHandler struct {
type RolloutHandler[T any] struct {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should it be runtime.Object rather than any?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

runtime.Object will required casting for the used Type and We want the RolloutHandler to be initiated with a certain type (eg; manifestwork, policy, addon, ...etc) not a runtime.object

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but any is a more loose constraint, isn't it?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, yes "any" is kind of loose, but we are safe. The main idea behind make a generic Type "any" is to avoid doing casting (or force certain type) at the clusterRolloutStatusFunc Function implementation similar to the unit test example here AND the helper lib is safe we don't call the clusterRolloutStatusFunc any where, its just a func definition that will be used to create the existing ClustersRollOutStatus at the consumer API side. What do you think ? Is that make sense ?

Copy link
Member

@xuezhaojun xuezhaojun Oct 17, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Run make update will fail on deepcopy functions generating step:

Generating deepcopy funcs
W1017 14:10:32.195161   62703 parse.go:863] Making unsupported type entry "T" for: &types.TypeParam{check:(*types.Checker)(nil), id:0x1, obj:(*types.TypeName)(0xc009651900), index:0, bound:(*types.Interface)(0xc00009a960)}
F1017 14:10:32.221298   62703 deepcopy.go:890] Hit an unsupported type open-cluster-management.io/api/cluster/v1alpha1.ClusterRolloutStatusFunc[T] for open-cluster-management.io/api/cluster/v1alpha1.ClusterRolloutStatusFunc[T], from open-cluster-management.io/api/cluster/v1alpha1.RolloutHandler[T]

Seems deepcopy lib doesn't support generic type yet. kubernetes/gengo#225

Either we revert the normal struct or we disable package-level generation and explicitly add the comment to each struct we need do deepcopy function generating: #288

cc @haoqing0110

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi, @dhaiducek , I test controller-gen with latest api main branch, it fails and generates unexpected code for generic type:

./_output/tools/bin/controller-gen object:headerFile="hack/empty.txt" paths="./cluster/v1alpha1"
open-cluster-management.io/api/cluster/v1alpha1:-: invalid type: func(clusterName string, workload T) (open-cluster-management.io/api/cluster/v1alpha1.ClusterRolloutStatus, error)
Error: not all generators ran successfully
run `controller-gen object:headerFile=hack/empty.txt paths=./cluster/v1alpha1 -w` to see all available markers, or `controller-gen object:headerFile=hack/empty.txt paths=./cluster/v1alpha1 -h` for usage

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As I mentioned above, we can also change to the explicit way to claim which struct needs gen deepcopy funcs: #288

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually it seems the gengo tooling may not be respecting the deepcopy-gen=false tag? In any case, I don't have a great deal of familiarity with deepcopy, but here's a commit that might work as an alternative to selectively generating the APIs:

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @xuezhaojun I'm able to reproduce it as well. having the same issue as @dhaiducek sounds like deepCopy generator does not respect the deepcopy-gen=false tag !
Based on the deepCopy code doc here setting the deepcopy-gen=false should work.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for providing the alternative! I will follow with it.

@@ -83,20 +88,21 @@ func NewRolloutHandler(pdTracker *clusterv1beta1.PlacementDecisionClustersTracke
//
// ClustersTimeOut: If the cluster status is Progressing or Failed, and the status lasts longer than timeout defined in strategy,
// will list them RolloutResult.ClustersTimeOut with status TimeOut.
func (r *RolloutHandler) GetRolloutCluster(rolloutStrategy RolloutStrategy, statusFunc ClusterRolloutStatusFunc) (*RolloutStrategy, RolloutResult, error) {
// func (r *RolloutHandler) GetRolloutCluster(rolloutStrategy RolloutStrategy, statusFunc ClusterRolloutStatusFunc) (*RolloutStrategy, RolloutResult, error) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove this line

}

func progressivePerCluster(clusterGroupsMap clusterv1beta1.ClusterGroupsMap, length int, timeout time.Duration, existingClusterStatus []ClusterRolloutStatus) RolloutResult {
rolloutClusters := []ClusterRolloutStatus{}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use var rolloutClusters, timeoutClusters []ClusterRolloutStatus

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

okay, done

Signed-off-by: melserngawy <[email protected]>
@qiujian16
Copy link
Member

/approve
/lgtm

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Sep 25, 2023

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: qiujian16, serngawy

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-merge-robot openshift-merge-robot merged commit bf4f47e into open-cluster-management-io:main Sep 25, 2023
9 checks passed
haoqing0110 pushed a commit to haoqing0110/api that referenced this pull request Nov 24, 2023
haoqing0110 pushed a commit to haoqing0110/api that referenced this pull request Nov 24, 2023
haoqing0110 pushed a commit to haoqing0110/api that referenced this pull request Nov 24, 2023
haoqing0110 pushed a commit to haoqing0110/api that referenced this pull request Nov 24, 2023
Signed-off-by: melserngawy <[email protected]>

Fix: run `make update` fail because deepcopy doesn't support generic type. (open-cluster-management-io#288)

Signed-off-by: xuezhaojun <[email protected]>

Fix verify ci step missing. (open-cluster-management-io#289)

Signed-off-by: xuezhaojun <[email protected]>

:bug: Use controller-runtime for deepcopy generation for cluster:v1alpha1 (open-cluster-management-io#291)

* Revert "Fix: run `make update` fail because deepcopy doesn't support generic type. (open-cluster-management-io#288)"

This reverts commit ae208c8.

Signed-off-by: Dale Haiducek <[email protected]>

* Use `controller-gen` for deepcopy cluster:v1alpha1

GenGo isn't respecting the `+k8s:deepcopy-gen=false` tag to skip
generation for the generic type

Signed-off-by: Dale Haiducek <[email protected]>

---------

Signed-off-by: Dale Haiducek <[email protected]>

🐛 add ca bundle to addon proxy settings (open-cluster-management-io#293)

Signed-off-by: Yang Le <[email protected]>

Revert "remove ClusterSet ClusterSetBinding API version v1beta1 (open-cluster-management-io#266)"

This reverts commit 9675ab7.

Signed-off-by: haoqing0110 <[email protected]>
openshift-merge-bot bot pushed a commit that referenced this pull request Nov 24, 2023
Fix: run `make update` fail because deepcopy doesn't support generic type. (#288)



Fix verify ci step missing. (#289)



:bug: Use controller-runtime for deepcopy generation for cluster:v1alpha1 (#291)

* Revert "Fix: run `make update` fail because deepcopy doesn't support generic type. (#288)"

This reverts commit ae208c8.



* Use `controller-gen` for deepcopy cluster:v1alpha1

GenGo isn't respecting the `+k8s:deepcopy-gen=false` tag to skip
generation for the generic type



---------



🐛 add ca bundle to addon proxy settings (#293)



Revert "remove ClusterSet ClusterSetBinding API version v1beta1 (#266)"

This reverts commit 9675ab7.

Signed-off-by: haoqing0110 <[email protected]>
Co-authored-by: Mohamed ElSerngawy <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants