Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Activate Early backoff functionality #253

Conversation

himanshu-kun
Copy link

@himanshu-kun himanshu-kun commented Sep 27, 2023

What this PR does / why we need it:
Activates Early backoff for mcm cloud provider

Which issue(s) this PR fixes:
Fixes #154

Special notes for your reviewer:

CORNER_CASE:

There is still a corner case which can block scale-up for a while. If node-grp is scaled-up 
such that, the same node-grp couldn't be scaled-down (ex- blocked scale-down due to rolling update) then , the set of pods which triggered the scale-up for the node-grp will not be considered `unschedulable` by autoscaler for `max-node-provision-time` (considering the VM doesn't join in that time)
this is because it will still considers the node in the node-grp (which we know won't join due to `ResourceExhausted`) as `Upcoming`. This can be justified also , because `ResourceExhausted` is a recoverable error so the node can still join.

Docs:
For now they are added in FAQ.md. Will move them to another folder, when refactoring CA docs overall.

Test results:

  1. Early backoff from a single nodegrp working as expected
Manual Test Case 1 Nodegrp out-of-quota is scaled-up first
I0927 12:32:38.650323   94730 klogx.go:87] Pod default/scale-up-pod-75b94d88b5-rdrfb is unschedulable
I0927 12:32:38.650333   94730 orchestrator.go:109] Upcoming 0 nodes
I0927 12:32:38.650759   94730 waste.go:55] Expanding Node Group shoot--i544024--early-bckf-worker-no-avail-z1 would waste 55.00% CPU, 99.39% Memory, 77.19% Blended
I0927 12:32:38.650773   94730 waste.go:55] Expanding Node Group shoot--i544024--early-bckf-worker-avail-z1 would waste 77.50% CPU, 99.69% Memory, 88.59% Blended
I0927 12:32:38.650783   94730 orchestrator.go:194] Best option to resize: shoot--i544024--early-bckf-worker-no-avail-z1
I0927 12:32:38.650791   94730 orchestrator.go:198] Estimated 1 nodes needed in shoot--i544024--early-bckf-worker-no-avail-z1
I0927 12:32:38.650820   94730 orchestrator.go:311] Final scale-up plan: [{shoot--i544024--early-bckf-worker-no-avail-z1 0->1 (max: 10)}]
I0927 12:32:38.650832   94730 orchestrator.go:583] Scale-up: setting group shoot--i544024--early-bckf-worker-no-avail-z1 size to 1

CA senses that node won’t come up due to Resource Exhausted , so marks nodegrp as backoff + removes the scaled up machine

I0927 12:32:59.639238   94730 clusterstate.go:1059] Found 1 instances with errorCode OutOfResource.ResourceExhausted in nodeGroup shoot--i544024--early-bckf-worker-no-avail-z1
I0927 12:32:59.639254   94730 clusterstate.go:1077] Failed adding 1 nodes (1 unseen previously) to group shoot--i544024--early-bckf-worker-no-avail-z1 due to OutOfResource.ResourceExhausted; errorMessages=[]string{"Create machine \"shoot--i544024--early-bckf-worker-no-avail-z1-6485c-vg2qn\" failed: googleapi: Error 400: Invalid value for field 'resource.machineType': 'zones/asia-northeast1-b/machineTypes/g2-standard-4'. Machine type with name 'g2-standard-4' does not exist in zone 'asia-northeast1-b'., invalid"}
W0927 12:32:59.639304   94730 clusterstate.go:287] Disabling scale-up for node group shoot--i544024--early-bckf-worker-no-avail-z1 until 2023-09-27 12:37:59.638262 +0530 IST m=+932.738328537; errorClass=OutOfResource; errorCode=ResourceExhausted
I0927 12:32:59.639328   94730 static_autoscaler.go:405] 1 unregistered nodes present
I0927 12:32:59.639346   94730 static_autoscaler.go:806] Deleting 1 from shoot--i544024--early-bckf-worker-no-avail-z1 node group because of create errors
I0927 12:32:59.639344   94730 event_sink_logging_wrapper.go:48] Event(v1.ObjectReference{Kind:"ConfigMap", Namespace:"kube-system", Name:"cluster-autoscaler-status", UID:"ed027710-58d0-46b9-9e59-66ccffa771fc", APIVersion:"v1", ResourceVersion:"79781", FieldPath:""}): type: 'Warning' reason: 'ScaleUpFailed' Failed adding 1 nodes to group shoot--i544024--early-bckf-worker-no-avail-z1 due to OutOfResource.ResourceExhausted; source errors: Create machine "shoot--i544024--early-bckf-worker-no-avail-z1-6485c-vg2qn" failed: googleapi: Error 400: Invalid value for field 'resource.machineType': 'zones/asia-northeast1-b/machineTypes/g2-standard-4'. Machine type with name 'g2-standard-4' does not exist in zone 'asia-northeast1-b'., invalid
I0927 12:32:59.856571   94730 mcm_manager.go:525] Machine shoot--i544024--early-bckf-worker-no-avail-z1-6485c-vg2qn of machineDeployment shoot--i544024--early-bckf-worker-no-avail-z1 marked with priority 1 successfully
I0927 12:32:59.856593   94730 mcm_manager.go:527] Expected to remove following {machineRef: corresponding node} pairs map[shoot--i544024--early-bckf-worker-no-avail-z1-6485c-vg2qn:]

CA tries another zone

W0927 12:33:10.465864   94730 orchestrator.go:510] Node group shoot--i544024--early-bckf-worker-no-avail-z1 is not ready for scaleup - backoff
I0927 12:33:10.466135   94730 waste.go:55] Expanding Node Group shoot--i544024--early-bckf-worker-avail-z1 would waste 77.50% CPU, 99.69% Memory, 88.59% Blended
I0927 12:33:10.466151   94730 orchestrator.go:194] Best option to resize: shoot--i544024--early-bckf-worker-avail-z1
I0927 12:33:10.466159   94730 orchestrator.go:198] Estimated 1 nodes needed in shoot--i544024--early-bckf-worker-avail-z1
I0927 12:33:10.466185   94730 orchestrator.go:311] Final scale-up plan: [{shoot--i544024--early-bckf-worker-avail-z1 3->4 (max: 5)}]
I0927 12:33:10.466193   94730 orchestrator.go:583] Scale-up: setting group shoot--i544024--early-bckf-worker-avail-z1 size to 4

First scaled up= 12:32:38
Next scale up after learning= 12:33:10 (in just 30sec !)

2) Early backoff from multiple nodegrps in a row working as expected
Manual test case 2 Trying scale-up in `no-avail`
I0927 13:38:12.451470    3487 orchestrator.go:311] Final scale-up plan: [{shoot--i544024--early-bckf-worker-no-avail-z1 0->1 (max: 10)}]

Backoff on failure

W0927 13:38:33.478725    3487 clusterstate.go:287] Disabling scale-up for node group shoot--i544024--early-bckf-worker-no-avail-z1 until 2023-09-27 13:43:33.477067 +0530 IST m=+336.462617495; errorClass=OutOfResource; errorCode=ResourceExhausted

Trying scale-up in no-avail2

I0927 13:38:44.265628    3487 orchestrator.go:311] Final scale-up plan: [{shoot--i544024--early-bckf-worker-noavail2-z1 0->1 (max: 10)}]

Backoff on failure

W0927 13:38:54.852199    3487 clusterstate.go:287] Disabling scale-up for node group shoot--i544024--early-bckf-worker-noavail2-z1 until 2023-09-27 13:43:54.84848 +0530 IST m=+357.833835085; errorClass=OutOfResource; errorCode=ResourceExhausted

Finally scaling up avail-z1

I0927 13:39:05.700509    3487 orchestrator.go:311] Final scale-up plan: [{shoot--i544024--early-bckf-worker-avail-z1 4->5 (max: 5)}]
  1. Early backoff doesn't happen for Invalid credentials error as it is an Internal error

Release note:

Gardener autoscaler now backs-off early from a node-group (i.e. machinedeployment) in case of `ResourceExhausted` error. Refer docs at `https://github.com/gardener/autoscaler/blob/machine-controller-manager-provider/cluster-autoscaler/FAQ.md#when-does-autoscaler-backs-off-early-from-a-node-group` for details.

@gardener-robot gardener-robot added needs/review Needs review size/xl Size of pull request is huge (see gardener-robot robot/bots/size.py) needs/second-opinion Needs second review by someone else labels Sep 27, 2023
@gardener-robot-ci-3 gardener-robot-ci-3 added the reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) label Sep 27, 2023
@gardener-robot-ci-1 gardener-robot-ci-1 added needs/ok-to-test Needs approval for testing (check PR in detail before setting this label because PR is run on CI/CD) and removed reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) labels Sep 27, 2023
@gardener-robot-ci-2 gardener-robot-ci-2 added reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) and removed reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) labels Sep 27, 2023
@himanshu-kun
Copy link
Author

/assign @rishabh-11

Copy link

@rishabh-11 rishabh-11 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR.
Some minor review comments, the rest looks good.

cluster-autoscaler/cloudprovider/mcm/mcm_manager.go Outdated Show resolved Hide resolved
cluster-autoscaler/cloudprovider/mcm/mcm_manager.go Outdated Show resolved Hide resolved
cluster-autoscaler/cloudprovider/mcm/mcm_manager.go Outdated Show resolved Hide resolved
cluster-autoscaler/cloudprovider/mcm/mcm_manager.go Outdated Show resolved Hide resolved
cluster-autoscaler/FAQ.md Outdated Show resolved Hide resolved
cluster-autoscaler/FAQ.md Outdated Show resolved Hide resolved
cluster-autoscaler/FAQ.md Outdated Show resolved Hide resolved
@gardener-robot gardener-robot added the needs/changes Needs (more) changes label Sep 28, 2023
@gardener-robot-ci-2 gardener-robot-ci-2 added the reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) label Sep 28, 2023
@gardener-robot-ci-1 gardener-robot-ci-1 removed the reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) label Sep 28, 2023
@gardener-robot-ci-1 gardener-robot-ci-1 added reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) and removed reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) labels Sep 28, 2023
@gardener-robot-ci-1 gardener-robot-ci-1 added the reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) label Oct 3, 2023
@gardener-robot-ci-1 gardener-robot-ci-1 removed the reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) label Oct 3, 2023
@gardener-robot-ci-1 gardener-robot-ci-1 added reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) and removed reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) labels Oct 3, 2023
@gardener-robot-ci-1 gardener-robot-ci-1 added reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) and removed reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) labels Oct 3, 2023
@gardener-robot-ci-1 gardener-robot-ci-1 added the reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) label Oct 3, 2023
@gardener-robot-ci-3 gardener-robot-ci-3 removed the reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) label Oct 3, 2023
@rishabh-11
Copy link

/lgtm

@gardener-robot gardener-robot added reviewed/lgtm Has approval for merging needs/changes Needs (more) changes and removed needs/changes Needs (more) changes needs/review Needs review needs/second-opinion Needs second review by someone else reviewed/lgtm Has approval for merging labels Oct 3, 2023
@gardener-robot-ci-1 gardener-robot-ci-1 added the reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) label Oct 3, 2023
Copy link

@unmarshall unmarshall left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@gardener-robot gardener-robot added reviewed/lgtm Has approval for merging and removed needs/changes Needs (more) changes labels Oct 3, 2023
@rishabh-11 rishabh-11 merged commit da973f4 into gardener:machine-controller-manager-provider Oct 3, 2023
4 checks passed
@gardener-robot gardener-robot added the status/closed Issue is closed (either delivered or triaged) label Oct 3, 2023
@himanshu-kun himanshu-kun deleted the early-backoff branch October 11, 2023 08:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
needs/ok-to-test Needs approval for testing (check PR in detail before setting this label because PR is run on CI/CD) reviewed/lgtm Has approval for merging reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) size/xl Size of pull request is huge (see gardener-robot robot/bots/size.py) status/closed Issue is closed (either delivered or triaged)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Early abort/backoff support for Gardener nodegroups a.k.a machinedeployments
8 participants