-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CAPI is taking too long removing taint node.cluster.x-k8s.io/uninitialized:NoSchedule from nodes #9858
CAPI is taking too long removing taint node.cluster.x-k8s.io/uninitialized:NoSchedule from nodes #9858
Comments
This issue is currently awaiting triage. If CAPI contributors determines this is a relevant issue, they will accept it by applying the The Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
In addition to this taking time, it also creates a requirement for the MC to be able to connect to the WC when new nodes are added. This means if you're using externally managed cluster-autoscaler (as in, running cluster-autoscaler in the workload cluster) and the MC is unavailable for any reason (network issue, upgrading, maintenance, etc) new nodes added by cluster-autoscaler will not become ready until the MC can connect again. This breaks one of the main reasons for wanting to run cluster-autoscaler in the WC in the first place - not having a reliance on the MC for scaling when needed. |
Question: does it only take too long for MachinePool based Machines which got created due to getting externally scaled? Some datapoints around the removal of the taint: It gets done at the
Which gets called from
And before that it waits e.g. for having a ProviderID set at the Machine as well as at the node inside the workload cluster:
Would be good to first know the reason why it takes too long. It could be that if it is scaled externally, that it takes too long for CAPI to reconcile and finish the node? I think we watch the node objects in a cluster 🤔 Do you have datapoints about if other work for the node finished? E.g. did a provider ID got set by the cloud provider running inside this cluster (as noted above)?
Regarding communication to the WC: there is a feature group around alternative communication patterns (don't know how active it is currently): https://github.com/kubernetes-sigs/cluster-api/blob/main/docs/community/20230111-alternative-communication-patterns.md |
Also there is a feature group around for karpenter: https://hackmd.io/@elmiko/ryR2VXR0n |
/triage needs-information |
/close |
@fabriziopandini: Closing this issue. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
I'm not sure if it's better to reopen this or create a new issue but we noticed similar issues on our CAPI (v1.7.3) and CAPZ (v1.5.2 with a few patches in open PRs) and cluster-autoscaler with clusterapi provider (v1.28.2 with a few patches in open PRs). We're observing this using MachinePools. Looking at metrics using the following prometheus query (replace
most have the taint for 30-60s. However, looking at the top50 by wrapping the query in I've been investigating a bit the code and some logs as to why that might happen in some cases. I've not fully verified this yet, but I hope anyone could confirm/reject my suspicion :)
Therefore: a prerequisite for label sync are the providerIDs which get populated by
Looking at the CAPZ code related to readyness: This means that when we're scaling up or down, the InfraMP is not marked ready:
and in this case the nodes stay with uninitialized until the InfraMP is ready again. Now my questions are:
|
@mweibel: You can't reopen an issue/PR unless you authored it or you are a collaborator. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
@mweibel reopening |
This issue is currently awaiting triage. If CAPI contributors determine this is a relevant issue, they will accept it by applying the The Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
While it sounds reasonable to check and see if this can be improved, have you also been able to find out why it took for that long for the underlying VMSS is to be in state succeeded and the replicas match? |
If I understand the CAPZ code correctly, it only adds ready MachinePoolMachines to the count in It depends on OS, and availabilty of the machines we need and on the amount of machines. We run batch workloads and sometimes scale up from close to zero (or zero) to 100 machines within a short amount of time. If we, let's say, scale up from 0 to 10, and a machine takes 1-5 minutes to be ready (linux machines take <1min usually, while windows machines average on 5min): The worst case above with the 3090s might have been some other issue - I've investigated a bit but will need to investigate some more to figure it out. Does that additional context help to make the issue more clear? |
Makes sense, I think its still worth to find the root cause though. Changing the above mentioned behaviour in CAPI would lead to a change of the contract documented here: https://main.cluster-api.sigs.k8s.io/developer/architecture/controllers/machine-pool Especially this part:
If we want to change how this works, we need to also take a look if the other providers supporting MachinePools are okay with it to not break things. So getting to the questions:
I think CAPZ is doing fine according to the contract. Depending on in which state CAPZ's MachinePool is, CAPZ could consider to earlier set the ready boolean to true.
I think this is defined in the contract so yes, CAPI should not continue unless the ready state is set to true, however, its the Infra Provider's decision when it signals to CAPI that it could continue. Note: I'm not that deep into the machine pool controller and machine pools, so don't consider this as strong opinion. But changing the contract IMHO should be done very very carefully. Also see the machine pool proposal which has more detailed diagrams about it. |
cc @mboersma |
In my opinion with cabpk this requirement exists independent of the taint. When a kubeadmconfig is created cabpk creates a bootstrap token secret within the wl cluster. It's unclear to me how this would work without access to the wl cluster. |
@AverageMarcus can you please share how you were able to create new nodes without connectivity with cluster autoscaler and cabpk before we introduced the taint I'm also curious to know how the cluster-autoscaler is able to scale up a MD in the mgmt cluster if the mgmt cluster is not available (or are we assuming the connection is only broken in the other direction in that scenario?) |
Shouldn't at some point the Nodes show up in the workload cluster and then that should trigger a reconcile of the Machine controller? If it doesn't, then that should be implement, otherwise it should be fixed if it's buggy See:
EDIT: Sorry I guess Christian was saying roughly the same already above |
Yes, the bootstrap token would still be a blocker if the Management Cluster is down for more than 15 minutes, which I think is too short of a period but that would be for another issue :) And yeah, I guess the issue is that reconciliation loop can take a bit of time. Would it make sense to only have this taint on the instances created by Machine CRs but not on the instances created via MachinePools? |
I would like to figure out first why the current implementation that should react immediately doesn't work (#9858 (comment)) There's also this dependency:
|
The current situation is it works, but there is a very big delay and my main grip is I don't see why the taint is needed at all as the instances are fully operational. Have no experience on MachineDeployments, we only use MachinePools |
FYI I linked a CAPZ issue here for a related issue. I continue to investigate this. @chrischdi FYI:
In our case we use the cluster-autoscaler in clusterapi provider mode.
I have the same feeling, but I also see why it needs to be there. We could probably avoid the taint if the MachinePool template doesn't specify labels - but I don't think this would be a good fix because it hides an underlying issue. |
Yeah, I can see why it's useful but wonder if we could either make it optional or decrease the reconciliation loop time to make it much faster to detect new nodes. |
Until then I'm -1 to fixes like disabling the feature or reducing the sync period. Just to be clear. With our current implementation we should immediately reconcile after a Node shows up in the workload cluster (this means within a few milliseconds the taint should be gone) |
How should it reconcile immediately when the nodes are created by cluster-autoscaler directly talking to the AWS API? Aren't we limited by the sync-period (10 minutes by default) when using cluster-autoscaler deployed in the workload cluster? |
As stefan pointed out: the controller should watch for the nodes in the cluster and get an event when the kubelet created its node: #9858 (comment) |
But that is only executed every "sync-period" when there are no changes to the CRs AFAIK |
A Kubernetes watch does not depend on the sync period (you can try with sync period is just the interval in which the local list => broadcast is executed for all objects currently in the local store |
Will try to gather logs and events from controllers, but I think this watch is not really triggering the removal of the taint. |
That is exactly what I'm trying to say :). For some reason this watch does not seem to work, and the proper fix is to figure out why |
These events are not the same as the events pushed by the Kubernetes watch API. |
This is the registered event on the workload cluster
|
None of our controllers are watching on these event objects. They are using the watch API. Please see https://kubernetes.io/docs/reference/using-api/api-concepts/#efficient-detection-of-changes |
+1 to investigate why existing watches on Node objects are not working as expected |
Have some additional information here: New node created (ip-10-0-204-6.eu-west-2.compute.internal, aws:///eu-west-2c/i-088c9f73c6f2ae2dc) at Tue, 16 Jul 2024 12:51:29 +0200 We can see multiple reconciliation loops:
Finally at 12:56 is get detected:
Based on what I can see, it seems like nodeRefs are not being updated soon enough and the code is following this path
I see NodeRefs are updated later in the code in
ProviderIDList is updated in
I also added some logging to nodeToMachinePool function that should be called from watchClusterNodes and it's never called. |
The current issue does not seem to be related to CAPI controller but the CAPA controller, this function is only ran every 10 minutes and is the one that update the ProviderIDList. |
Great job in triaging this issue! |
Based on the evidences above, is it ok to close this issue (you should probably open a new one in CAPA)?
|
(I think it is a yes) |
@fabriziopandini: Closing this issue. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
This CAPA problem is basically the same issue: kubernetes-sigs/cluster-api-provider-aws#4618 |
What steps did you take and what happened?
From reading the docs bootstrap provider can optionally taint worker nodes at creation with
node.cluster.x-k8s.io/uninitialized:NoSchedule
. We noticed that CAPI is taking up to 5 minutes removing the taint from nodes.Unfortunately this is marked as optional but there is no flag to omit it. From observing the code
we cannot see any place where this is possible.
We don't think it is intended, especially when managing MachinePools externally by cluster-autoscaler.
Additionally there are other provider specific controller like
aws-cloud-controller-manager
who sets a similar taintnode.cloudprovider.kubernetes.io/uninitialized
waiting for nodes becoming ready.What did you expect to happen?
Ensure taint configuration in bootstrap provider is optional.
Cluster API version
1.4.7
Kubernetes version
No response
Anything else you would like to add?
No response
Label(s) to be applied
/kind bug
/area bootstrap
The text was updated successfully, but these errors were encountered: