Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Nodes bootstrapped via aws-hosted-cp get stuck with node.cloudprovider.kubernetes.io/uninitialized taint #290

Closed
squizzi opened this issue Sep 10, 2024 · 4 comments · Fixed by #280
Assignees
Labels
bug Something isn't working

Comments

@squizzi
Copy link
Contributor

squizzi commented Sep 10, 2024

While testing aws-hosted-cp for our e2e test work in #280 I encountered a situation where nodes that are deployed using the aws-hosted-cp template cannot use CCM resources as they get stuck with the following taint:

    taints:
    - effect: NoSchedule
      key: node.cloudprovider.kubernetes.io/uninitialized
      value: "true"

Relates to: kubernetes-sigs/cluster-api#9858. kubernetes-sigs/cluster-api-provider-aws#4618

Stripping the taint allows the test to progress.

@squizzi
Copy link
Contributor Author

squizzi commented Sep 10, 2024

LoadBalancer won't get an external IP assigned either, stripping the taint has no effect on this, because CCM isn't running konnectivity won't start up with:

konnectivity-agent-l4m8m                   0/1     CreateContainerConfigError   0          88s
konnectivity-agent-v9lpb                   0/1     CreateContainerConfigError   0          88s

...

  Warning  Failed     2s (x5 over 28s)  kubelet            Error: host IP unknown; known addresses: []

@squizzi
Copy link
Contributor Author

squizzi commented Sep 10, 2024

This seems related to CCM not running correctly on workload clusters due to some form of permissions issue, looking into it further.

@squizzi
Copy link
Contributor Author

squizzi commented Sep 10, 2024

For whatever reason this is intermittent, when trying to debug this on another run everything went fine.

@DinaBelova DinaBelova added the bug Something isn't working label Sep 12, 2024
@DinaBelova DinaBelova moved this to Todo in Project 2A Sep 12, 2024
@squizzi squizzi self-assigned this Sep 12, 2024
@squizzi
Copy link
Contributor Author

squizzi commented Sep 12, 2024

I can reproduce this pretty well now, I'm going to repro it outside of CI and see if I can debug and get to the bottom of this.

@squizzi squizzi moved this from Todo to In Progress in Project 2A Sep 12, 2024
squizzi added a commit that referenced this issue Sep 13, 2024
* Bump aws-*-cp templates to 0.1.3
* Bump cluster-api-provider-aws template to 0.1.2
* Delete csi-driver, ccm validation tests from aws-hosted-cp test
  until #290 is resolved so that we don't get stuck there and can
  properly test deletion.

Closes: #152

Signed-off-by: Kyle Squizzato <[email protected]>
squizzi added a commit that referenced this issue Sep 14, 2024
* Break KubeClient helpers into provider specific file
* Finish aws-hosted-cp test and add comments through test to make it
  easier to understand.
* Use GinkgoHelper across e2e tests, populate hosted vars from
  AWSCluster.
* No longer rely on local registry for images in test/e2e.
* Support OS for awscli install.
* Prepend hostname to collected log artifacts.
* Support no cleanup of provider specs, differentiate ci
  cluster names.
* Add docs on running tests, do not wait for all providers
  if configured.
* Reinstantiate resource validation map on each instance of
  validation.
* Enable the external-gc feature via annotation, featureGate
  bool. (Closes: #152)
* Bump aws-*-cp templates to 0.1.3
* Bump cluster-api-provider-aws template to 0.1.2
* Improve test logging to log template name and validation
  phase.
* Bump k0s version to v1.30.4+k0s.0, set CCM nodeSelector to
  null for aws-hosted-cp. (Closes: #290)
* Break cleanup into seperate job so that it is unaffected by
  concurrency group cancellations.

Closes: #212

Signed-off-by: Kyle Squizzato <[email protected]>
squizzi added a commit that referenced this issue Sep 14, 2024
* Break KubeClient helpers into provider specific file
* Finish aws-hosted-cp test and add comments through test to make it
  easier to understand.
* Use GinkgoHelper across e2e tests, populate hosted vars from
  AWSCluster.
* No longer rely on local registry for images in test/e2e.
* Support OS for awscli install.
* Prepend hostname to collected log artifacts.
* Support no cleanup of provider specs, differentiate ci
  cluster names.
* Add docs on running tests, do not wait for all providers
  if configured.
* Reinstantiate resource validation map on each instance of
  validation.
* Enable the external-gc feature via annotation, featureGate
  bool. (Closes: #152)
* Bump aws-*-cp templates to 0.1.3
* Bump cluster-api-provider-aws template to 0.1.2
* Improve test logging to log template name and validation
  phase.
* Bump k0s version to v1.30.4+k0s.0, set CCM nodeSelector to
  null for aws-hosted-cp. (Closes: #290)
* Break cleanup into seperate job so that it is unaffected by
  concurrency group cancellations.

Closes: #212

Signed-off-by: Kyle Squizzato <[email protected]>
squizzi added a commit that referenced this issue Sep 14, 2024
* Break KubeClient helpers into provider specific file
* Finish aws-hosted-cp test and add comments through test to make it
  easier to understand.
* Use GinkgoHelper across e2e tests, populate hosted vars from
  AWSCluster.
* No longer rely on local registry for images in test/e2e.
* Support OS for awscli install.
* Prepend hostname to collected log artifacts.
* Support no cleanup of provider specs, differentiate ci
  cluster names.
* Add docs on running tests, do not wait for all providers
  if configured.
* Reinstantiate resource validation map on each instance of
  validation.
* Enable the external-gc feature via annotation, featureGate
  bool. (Closes: #152)
* Bump aws-*-cp templates to 0.1.3
* Bump cluster-api-provider-aws template to 0.1.2
* Improve test logging to log template name and validation
  phase.
* Bump k0s version to v1.30.4+k0s.0, set CCM nodeSelector to
  null for aws-hosted-cp. (Closes: #290)
* Break cleanup into seperate job so that it is unaffected by
  concurrency group cancellations.
* Make dev-aws-nuke target less PHONY.

Closes: #212

Signed-off-by: Kyle Squizzato <[email protected]>
squizzi added a commit that referenced this issue Sep 16, 2024
* Break KubeClient helpers into provider specific file.
* Try to simplify the validation process for lots of different providers
  with different requirements.
* Finish aws-hosted-cp test and add comments through test to make it
  easier to understand.
* Use GinkgoHelper across e2e tests, populate hosted vars from
  AWSCluster.
* No longer rely on local registry for images in test/e2e.
* Support OS for awscli install.
* Prepend hostname to collected log artifacts.
* Support no cleanup of provider specs, differentiate ci
  cluster names.
* Add docs on running tests, do not wait for all providers
  if configured.
* Reinstantiate resource validation map on each instance of
  validation.
* Enable the external-gc feature via annotation, featureGate
  bool. (Closes: #152)
* Bump aws-*-cp templates to 0.1.3
* Bump cluster-api-provider-aws template to 0.1.2
* Improve test logging to log template name and validation
  phase.
* Bump k0s version to v1.30.4+k0s.0, set CCM nodeSelector to
  null for aws-hosted-cp. (Closes: #290)
* Break cleanup into seperate job so that it is unaffected by
  concurrency group cancellations.
* Make dev-aws-nuke target less PHONY.

Closes: #212

Signed-off-by: Kyle Squizzato <[email protected]>
squizzi added a commit that referenced this issue Sep 16, 2024
* Break KubeClient helpers into provider specific file.
* Try to simplify the validation process for lots of different providers
  with different requirements.
* Finish aws-hosted-cp test and add comments through test to make it
  easier to understand.
* Use GinkgoHelper across e2e tests, populate hosted vars from
  AWSCluster.
* No longer rely on local registry for images in test/e2e.
* Support OS for awscli install.
* Prepend hostname to collected log artifacts.
* Support no cleanup of provider specs, differentiate ci
  cluster names.
* Add docs on running tests, do not wait for all providers
  if configured.
* Reinstantiate resource validation map on each instance of
  validation.
* Enable the external-gc feature via annotation, featureGate
  bool. (Closes: #152)
* Bump aws-*-cp templates to 0.1.3
* Bump cluster-api-provider-aws template to 0.1.2
* Improve test logging to log template name and validation
  phase.
* Bump k0s version to v1.30.4+k0s.0, set CCM nodeSelector to
  null for aws-hosted-cp. (Closes: #290)
* Break cleanup into seperate job so that it is unaffected by
  concurrency group cancellations.
* Make dev-aws-nuke target less PHONY.

Closes: #212

Signed-off-by: Kyle Squizzato <[email protected]>
squizzi added a commit that referenced this issue Sep 16, 2024
* Break KubeClient helpers into provider specific file.
* Try to simplify the validation process for lots of different providers
  with different requirements.
* Finish aws-hosted-cp test and add comments through test to make it
  easier to understand.
* Use GinkgoHelper across e2e tests, populate hosted vars from
  AWSCluster.
* No longer rely on local registry for images in test/e2e.
* Support OS for awscli install.
* Prepend hostname to collected log artifacts.
* Support no cleanup of provider specs, differentiate ci
  cluster names.
* Add docs on running tests, do not wait for all providers
  if configured.
* Reinstantiate resource validation map on each instance of
  validation.
* Enable the external-gc feature via annotation, featureGate
  bool. (Closes: #152)
* Bump aws-*-cp templates to 0.1.3
* Bump cluster-api-provider-aws template to 0.1.2
* Improve test logging to log template name and validation
  phase.
* Bump k0s version to v1.30.4+k0s.0, set CCM nodeSelector to
  null for aws-hosted-cp. (Closes: #290)
* Break cleanup into seperate job so that it is unaffected by
  concurrency group cancellations.
* Make dev-aws-nuke target less PHONY.
* Only build linux/amd64 arch since CI does not need arm.

Signed-off-by: Kyle Squizzato <[email protected]>
@github-project-automation github-project-automation bot moved this from In Progress to Done in Project 2A Sep 17, 2024
bnallapeta pushed a commit to bnallapeta/hmc that referenced this issue Nov 15, 2024
* Break KubeClient helpers into provider specific file.
* Try to simplify the validation process for lots of different providers
  with different requirements.
* Finish aws-hosted-cp test and add comments through test to make it
  easier to understand.
* Use GinkgoHelper across e2e tests, populate hosted vars from
  AWSCluster.
* No longer rely on local registry for images in test/e2e.
* Support OS for awscli install.
* Prepend hostname to collected log artifacts.
* Support no cleanup of provider specs, differentiate ci
  cluster names.
* Add docs on running tests, do not wait for all providers
  if configured.
* Reinstantiate resource validation map on each instance of
  validation.
* Enable the external-gc feature via annotation, featureGate
  bool. (Closes: Mirantis#152)
* Bump aws-*-cp templates to 0.1.3
* Bump cluster-api-provider-aws template to 0.1.2
* Improve test logging to log template name and validation
  phase.
* Bump k0s version to v1.30.4+k0s.0, set CCM nodeSelector to
  null for aws-hosted-cp. (Closes: Mirantis#290)
* Break cleanup into seperate job so that it is unaffected by
  concurrency group cancellations.
* Make dev-aws-nuke target less PHONY.
* Only build linux/amd64 arch since CI does not need arm.

Signed-off-by: Kyle Squizzato <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
Status: Done
Development

Successfully merging a pull request may close this issue.

2 participants