The k8ssandra-operator mutating webhook should be restricted to cass-operator managed pods #1172

vcanuel · 2024-01-13T12:49:45Z

What happened?

After an automatic upgrade of my Kubernetes cluster on Google Cloud Platform (GCP), I encountered connectivity issues with the k8ssandra-operator-webhook-service. This resulted in numerous deployment failures, including the metrics-server, leading to significant instability in my cluster. I observed the following error messages for every deployment in the cluster:

Internal error occurred: failed calling webhook "mpod.kb.io": failed to call webhook: Post "https://k8ssandra-operator-webhook-service.k8ssandra-operator.svc:443/mutate-v1-pod-secrets-inject?timeout=10s": no endpoints available for service "k8ssandra-operator-webhook-service"

Internal error occurred: failed calling webhook "mpod.kb.io": failed to call webhook: Post "https://k8ssandra-operator-webhook-service.k8ssandra-operator.svc:443/mutate-v1-pod-secrets-inject?timeout=10s": No agent available

Despite restarting the operator, the problem continues.

Did you expect to see something different?

Yes, post-upgrade, I expected the cluster to remain stable with all services functioning correctly, including the webhook service. Essential deployments, especially the metrics-server, were expected to launch without issues.

How to reproduce it (as minimally and precisely as possible):

Update from 1.27.3-gke.100 to 1.27.7-gke.1056000

The Kubernetes cluster undergoes an automatic upgrade on GCP.
Post-upgrade, observe the behavior of k8ssandra-operator-webhook-service and the launching of deployments.

Environment

K8ssandra Operator version:
1.11.0
Kubernetes version information:
Client Version: v1.29.0
Kustomize Version: v5.0.4-0.20230601165947-6ce0bf390ce3
Server Version: v1.27.7-gke.1056000
Kubernetes cluster kind:
Google Cloud Platform (GCP) managed Kubernetes cluster.
Manifests:
K8ssandra Operator Logs:

Anything else I need to know?:

The issue has led to a large number of deployment failures and significantly impacted the stability of my Kubernetes environment. I am seeking insights or guidance on resolving these post-upgrade issues.

The text was updated successfully, but these errors were encountered:

adejanovski · 2024-01-13T17:19:29Z

Hi @vcanuel, very sorry about this. You can resolve this by deleting the mutating webhook manually in order to let the cluster restart.

You should have something like this:

 % kubectl get MutatingWebhookConfiguration
NAME                                                       WEBHOOKS   AGE
cert-manager-webhook                                       1          204d
k8ssandra-operator-mutating-webhook-configuration          1          204d
neg-annotation.config.common-webhooks.networking.gke.io    1          204d
pod-ready.config.common-webhooks.networking.gke.io         1          204d
warden-mutating.config.common-webhooks.networking.gke.io   1          198d

And then you can delete the k8ssandra-operator mutating webhook with:

kubectl delete MutatingWebhookConfiguration/k8ssandra-operator-mutating-webhook-configuration

Re-installing the operator once the cluster is back up and running should recreate the webhook.
We need to investigate what causes this kind of behavior and how we can avoid this from happening in the future.

adejanovski · 2024-01-13T17:22:32Z

I think the fix for us should be to change the failure policy from Fail to Ignore.
@burmanm, wdyt?

@vcanuel, you can probably try this yourself by editing the webhook instead of deleting it.

vcanuel · 2024-01-13T20:24:07Z

Thanks for your quick response.

I have restored the cluster from an earlier snapshot, as this occurred in our production environment. I will keep your advice as a reference in case this issue arises again. There's still a lot for me to learn about Kubernetes :).

adejanovski · 2024-01-15T07:04:41Z

We'll have that fixed in our next release which is planned for the beginning of February at most.
So after the next upgrade you shouldn't run into that issue at all.

vcanuel added the bug Something isn't working label Jan 13, 2024

adejanovski added this to K8ssandra Jan 13, 2024

adejanovski moved this to Assess/Investigate in K8ssandra Jan 13, 2024

adejanovski added the assess Issues in the state 'assess' label Jan 13, 2024

adejanovski self-assigned this Jan 15, 2024

adejanovski moved this from Assess/Investigate to In Progress in K8ssandra Jan 15, 2024

adejanovski added in-progress Issues in the state 'in-progress' and removed assess Issues in the state 'assess' labels Jan 15, 2024

adejanovski changed the title ~~Automatic upgrade to GCP cluster leads to k8ssandra-operator-webhook-service failure and cluster breakdown~~ The k8ssandra-operator mutating webhook should be restricted to cass-operator managed pods Jan 15, 2024

adejanovski mentioned this issue Jan 15, 2024

Restrict the mutating webhook to cass-operator managed pods #1173

Merged

5 tasks

adejanovski moved this from In Progress to Ready For Review in K8ssandra Jan 15, 2024

adejanovski added ready-for-review Issues in the state 'ready-for-review' and removed in-progress Issues in the state 'in-progress' labels Jan 15, 2024

adejanovski closed this as completed in #1173 Jan 16, 2024

github-project-automation bot moved this from Ready For Review to Done in K8ssandra Jan 16, 2024

adejanovski added done Issues in the state 'done' and removed ready-for-review Issues in the state 'ready-for-review' labels Jan 16, 2024

adejanovski mentioned this issue Jan 16, 2024

Cannot add nodes to EKS cluster after the k8ssandra-operator installed in this cluster and delete all nodes #1175

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

The k8ssandra-operator mutating webhook should be restricted to cass-operator managed pods #1172

The k8ssandra-operator mutating webhook should be restricted to cass-operator managed pods #1172

vcanuel commented Jan 13, 2024 •

edited

Loading

adejanovski commented Jan 13, 2024

adejanovski commented Jan 13, 2024

vcanuel commented Jan 13, 2024

adejanovski commented Jan 15, 2024

The k8ssandra-operator mutating webhook should be restricted to cass-operator managed pods #1172

The k8ssandra-operator mutating webhook should be restricted to cass-operator managed pods #1172

Comments

vcanuel commented Jan 13, 2024 • edited Loading

adejanovski commented Jan 13, 2024

adejanovski commented Jan 13, 2024

vcanuel commented Jan 13, 2024

adejanovski commented Jan 15, 2024

vcanuel commented Jan 13, 2024 •

edited

Loading