Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The k8ssandra-operator mutating webhook should be restricted to cass-operator managed pods #1172

Closed
vcanuel opened this issue Jan 13, 2024 · 4 comments · Fixed by #1173
Closed
Assignees
Labels
bug Something isn't working done Issues in the state 'done'

Comments

@vcanuel
Copy link

vcanuel commented Jan 13, 2024

What happened?

After an automatic upgrade of my Kubernetes cluster on Google Cloud Platform (GCP), I encountered connectivity issues with the k8ssandra-operator-webhook-service. This resulted in numerous deployment failures, including the metrics-server, leading to significant instability in my cluster. I observed the following error messages for every deployment in the cluster:

Internal error occurred: failed calling webhook "mpod.kb.io": failed to call webhook: Post "https://k8ssandra-operator-webhook-service.k8ssandra-operator.svc:443/mutate-v1-pod-secrets-inject?timeout=10s": no endpoints available for service "k8ssandra-operator-webhook-service"

Internal error occurred: failed calling webhook "mpod.kb.io": failed to call webhook: Post "https://k8ssandra-operator-webhook-service.k8ssandra-operator.svc:443/mutate-v1-pod-secrets-inject?timeout=10s": No agent available

Despite restarting the operator, the problem continues.

Did you expect to see something different?

Yes, post-upgrade, I expected the cluster to remain stable with all services functioning correctly, including the webhook service. Essential deployments, especially the metrics-server, were expected to launch without issues.

How to reproduce it (as minimally and precisely as possible):

Update from 1.27.3-gke.100 to 1.27.7-gke.1056000

The Kubernetes cluster undergoes an automatic upgrade on GCP.
Post-upgrade, observe the behavior of k8ssandra-operator-webhook-service and the launching of deployments.

Environment

  • K8ssandra Operator version:
    1.11.0

  • Kubernetes version information:
    Client Version: v1.29.0
    Kustomize Version: v5.0.4-0.20230601165947-6ce0bf390ce3
    Server Version: v1.27.7-gke.1056000

  • Kubernetes cluster kind:
    Google Cloud Platform (GCP) managed Kubernetes cluster.

  • Manifests:

  • K8ssandra Operator Logs:

Anything else I need to know?:

The issue has led to a large number of deployment failures and significantly impacted the stability of my Kubernetes environment. I am seeking insights or guidance on resolving these post-upgrade issues.

@vcanuel vcanuel added the bug Something isn't working label Jan 13, 2024
@adejanovski
Copy link
Contributor

Hi @vcanuel, very sorry about this. You can resolve this by deleting the mutating webhook manually in order to let the cluster restart.

You should have something like this:

 % kubectl get MutatingWebhookConfiguration
NAME                                                       WEBHOOKS   AGE
cert-manager-webhook                                       1          204d
k8ssandra-operator-mutating-webhook-configuration          1          204d
neg-annotation.config.common-webhooks.networking.gke.io    1          204d
pod-ready.config.common-webhooks.networking.gke.io         1          204d
warden-mutating.config.common-webhooks.networking.gke.io   1          198d

And then you can delete the k8ssandra-operator mutating webhook with:

kubectl delete MutatingWebhookConfiguration/k8ssandra-operator-mutating-webhook-configuration

Re-installing the operator once the cluster is back up and running should recreate the webhook.
We need to investigate what causes this kind of behavior and how we can avoid this from happening in the future.

@adejanovski
Copy link
Contributor

I think the fix for us should be to change the failure policy from Fail to Ignore.
@burmanm, wdyt?

@vcanuel, you can probably try this yourself by editing the webhook instead of deleting it.

@adejanovski adejanovski moved this to Assess/Investigate in K8ssandra Jan 13, 2024
@adejanovski adejanovski added the assess Issues in the state 'assess' label Jan 13, 2024
@vcanuel
Copy link
Author

vcanuel commented Jan 13, 2024

Thanks for your quick response.

I have restored the cluster from an earlier snapshot, as this occurred in our production environment. I will keep your advice as a reference in case this issue arises again. There's still a lot for me to learn about Kubernetes :).

@adejanovski
Copy link
Contributor

We'll have that fixed in our next release which is planned for the beginning of February at most.
So after the next upgrade you shouldn't run into that issue at all.

@adejanovski adejanovski self-assigned this Jan 15, 2024
@adejanovski adejanovski moved this from Assess/Investigate to In Progress in K8ssandra Jan 15, 2024
@adejanovski adejanovski added in-progress Issues in the state 'in-progress' and removed assess Issues in the state 'assess' labels Jan 15, 2024
@adejanovski adejanovski changed the title Automatic upgrade to GCP cluster leads to k8ssandra-operator-webhook-service failure and cluster breakdown The k8ssandra-operator mutating webhook should be restricted to cass-operator managed pods Jan 15, 2024
@adejanovski adejanovski moved this from In Progress to Ready For Review in K8ssandra Jan 15, 2024
@adejanovski adejanovski added ready-for-review Issues in the state 'ready-for-review' and removed in-progress Issues in the state 'in-progress' labels Jan 15, 2024
@github-project-automation github-project-automation bot moved this from Ready For Review to Done in K8ssandra Jan 16, 2024
@adejanovski adejanovski added done Issues in the state 'done' and removed ready-for-review Issues in the state 'ready-for-review' labels Jan 16, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working done Issues in the state 'done'
Projects
No open projects
Archived in project
Development

Successfully merging a pull request may close this issue.

2 participants