Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bug: removal of VM k8s chart does not work well. #1563

Open
gecube opened this issue Oct 7, 2024 · 8 comments · Fixed by #1570
Open

bug: removal of VM k8s chart does not work well. #1563

gecube opened this issue Oct 7, 2024 · 8 comments · Fixed by #1570

Comments

@gecube
Copy link

gecube commented Oct 7, 2024

Good day!

The issue is observed on the all versions of victoria-metrics-k8s-stack chart.

The steps to reproduce:

  1. install victoria-metrics-k8s-stack
  2. remove ns where victoria-metrics-k8s-stack is installed
  3. as the order of deletion of the objects in ns is not defined, there is a possibility that operator would be removed earlier than the resources it manages, effectively making ns stuck in Terminating state.

Possible solutions:

  • not recommend to install both operator and CRs it manages in the same NS
  • split operator chart and CRs into different charts and/or namespaces and recommended such an installation

Also I got the same result when I tried to remove the victoria-metrics-k8s-stack chart without ns deletion.

  1. install victoria-metrics-k8s-stack with FluxCD
  2. remove HelmRelease from gitops repo and/or invoke kubectl delete helmrelease -n monitoring-system victoria-metrics-k8s-stack
  3. get pods stuck
kubectl get pods -A
...
monitoring-system   vmagent-victoria-metrics-k8s-stack-5d4d45898d-47qwx    2/2     Running            2 (42m ago)   4d19h
monitoring-system   vmalert-victoria-metrics-k8s-stack-846689d7db-r6dh6    2/2     Running            2 (42m ago)   4d19h
monitoring-system   vmalertmanager-victoria-metrics-k8s-stack-0            0/2     Unknown            0             4d22h
monitoring-system   vmsingle-victoria-metrics-k8s-stack-7565649d7c-7v624   1/1     Running            1 (42m ago)   4d19h

From written above I am concluding that the process of chart's removal was not testes and it is subject for race conditions (if operator is accidentally removed BEFORE CRs it manages, they stuck).

I am kindly asking to make an investigation and confirm the bug.

@AndrewChubatiuk
Copy link
Collaborator

hey @gecube
this behaviour is expected, CRs should not be removed together with operator. To remove CRs operator chart has delete hook, which removes CRs (victoria-metrics-operator.crd.cleanup.enabled option in k8s-stack chart), but I have no idea how it works with Flux CD. Regarding CRs deployment in another namespace, you can always set victoria-metrics-operator.namespaceOverride to run operator in a separate namespace

@gecube
Copy link
Author

gecube commented Oct 9, 2024

@AndrewChubatiuk there is a difference between CRD and CR. So let's don't confuse them.

CRD - is like an api extension. It's ok if they won't be deleted. As user may installed something additional besides VM stack
CR - is an instantiation of CRD, particular object in cluster. And what I see - with default installation there are issues with races. I was able to reproduce it at least several times (~5) with different test cases: like removal of ns completely (it's one situation) and removal of k8s stack chart (another situation). Both situation leads to orphaned resources as I demonstrated.

As an user of chart I am expecting that removal of chart (or even NS) will remove all objects and leave CRDs untouched.

@gecube
Copy link
Author

gecube commented Oct 9, 2024

BTW, on my production I effectively fix this issue just by installing not k8s stack chart with all options enabled, but rather installing three charts:

  • operator
  • grafana
  • vm k8s stack with grafana and operator disabled
    into separate NSs.

I tested and it gives more control to the devops team regarding installation/uninstallation and day 2 operations. There are some pitfalls though, but clean-up works well.

@AndrewChubatiuk
Copy link
Collaborator

Resources managed by CRs cannot be removed if operator was removed as they are having operator's finalizer. Operator removal cannot be a reason for CRs removal as it can happen due to many reasons, e.g.: deployment spec changes. Order of resources removal is not available in Helm at the moment. So for now you can set a custom victoria-metrics-operator.namespaceOverride value to deploy operator in a separate namespace. In this case there should be no issues with helm delete hook, which removes CRs during helm uninstall action

@gecube
Copy link
Author

gecube commented Oct 9, 2024

I think postStop hook and terminationGracePeriodSeconds could help. If we'd add them to operator pods, then it won't be deleted instantly and would have some chance to collect CRs. Unfortunately it is not like a robust solution as there is no strict ordering in helm...

for now you can set a custom victoria-metrics-operator.namespaceOverride

Thanks for the advice. But still sounds like a placebo :-)

@Haleygo
Copy link
Contributor

Haleygo commented Oct 10, 2024

@gecube Thanks for the issue!
I agree there is no way to fix the issue when trying to delete both the operator and VM CR by directly deleting the namespace(and as I understand, it's not often used and not recommended). The proposal of adding postStop and terminationGracePeriodSeconds to delay operator pod is unreliable indeed.

Also I got the same result when I tried to remove the victoria-metrics-k8s-stack chart without ns deletion.

  1. install victoria-metrics-k8s-stack with FluxCD
  2. remove HelmRelease from gitops repo and/or invoke kubectl delete helmrelease -n monitoring-system victoria-metrics-k8s-stack
  3. get pods stuck
    kubectl get pods -A
    ...
    monitoring-system vmagent-victoria-metrics-k8s-stack-5d4d45898d-47qwx 2/2 Running 2 (42m ago) 4d19h
    monitoring-system vmalert-victoria-metrics-k8s-stack-846689d7db-r6dh6 2/2 Running 2 (42m ago) 4d19h
    monitoring-system vmalertmanager-victoria-metrics-k8s-stack-0 0/2 Unknown 0 4d22h
    monitoring-system vmsingle-victoria-metrics-k8s-stack-7565649d7c-7v624 1/1 Running 1 (42m ago) 4d19h

But it's not expected to have problems when uninstall k8s-stack release with operator clean up hook enabled. As the VM CRs should be deleted by the vm-operator cleanup job before operator pod gets deleted, so nothing should be blocked.
From the kubectl get pods -A output you provided, I don't see operator or cleanup-hook pod. Could you tell the version of k8s-stack chart you're testing and confirm whether you've disabled victoria-metrics-operator.crd.cleanuup.enabled?

@AndrewChubatiuk
Copy link
Collaborator

added ability to configure lifecycle and terminationGracePeriodSeconds in operator chart 0.35.3 and k8s-stack 0.27.1.

@gecube
Copy link
Author

gecube commented Nov 8, 2024

I again faced issues with the removal.

I am stupid guy.

monitoring-system   vmagent-victoria-metrics-k8s-stack-7bcf958cc-g6xrb    2/2     Running     0          2d20h
monitoring-system   vmalert-victoria-metrics-k8s-stack-5878997d95-drqw7   2/2     Running     0          2d20h
monitoring-system   vmalertmanager-victoria-metrics-k8s-stack-0           2/2     Running     0          7d2h
monitoring-system   vmsingle-victoria-metrics-k8s-stack-68d89897b-m6t2h   0/1     Pending     0          2d20h

Everything is stuck after I accidentally removed the operator.

I have the next structure of repo:

Screenshot 2024-11-08 at 15 50 05

The helmrelease.yaml installs the operator and vm as a separate helmreleases:

---
apiVersion: helm.toolkit.fluxcd.io/v2beta2
kind: HelmRelease
metadata:
  name: victoria-metrics-operator
  namespace: victoria-metrics-operator
spec:
  chart:
    spec:
      chart: victoria-metrics-operator
      sourceRef:
        kind: HelmRepository
        name: vm
      version: '*'
  interval: 1m0s
  values:
    crd:
      create: false
---
apiVersion: helm.toolkit.fluxcd.io/v2beta2
kind: HelmRelease
metadata:
  name: victoria-metrics-k8s-stack
  namespace: monitoring-system
spec:
  chart:
    spec:
      chart: victoria-metrics-k8s-stack
      sourceRef:
        kind: HelmRepository
        name: vm
      version: '*'
  interval: 1m0s
  values:
    kube-state-metrics:
      enabled: true
      extraArgs:
      - --metric-labels-allowlist=pods=[*]
    victoria-metrics-operator:
      enabled: false
    vmsingle:
      enabled: true
      spec:
...

I commented it out from kustomization.yaml

apiVersion: kustomize.config.k8s.io/v1beta1
kind: Kustomization
resources:
  - namespace.yaml
  - helmrepository.yaml
#  - helmrelease.yaml
configMapGenerator:
  - name: kube-state-metrics-config
    namespace: monitoring-system
    files:
      - kube-state-metrics-config.yaml
    options:
      labels:
        app.kubernetes.io/part-of: flux
        app.kubernetes.io/component: monitoring
        kustomize.toolkit.fluxcd.io/substitute: disabled
configurations:
  - kustomizeconfig.yaml

and everything broke... It is very interesting if some way exist how to prevent it. Anyway I am completely understand that I was not accurate and did not predict such a consequences. So it's my fault this time. But there was no like "guardrails" or anything that will stop me from shooting in my foot.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
3 participants