Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a way to temporarily prevent node deletion a.k.a Freeze machine #818

Open
etiennnr opened this issue May 25, 2023 · 1 comment
Open
Labels
area/quality Output qualification (tests, checks, scans, automation in general, etc.) related area/robustness Robustness, reliability, resilience related kind/enhancement Enhancement, improvement, extension lifecycle/stale Nobody worked on this for 6 months (will further age) needs/planning Needs (more) planning with other MCM maintainers priority/3 Priority (lower number equals higher priority)

Comments

@etiennnr
Copy link

etiennnr commented May 25, 2023

How to categorize this issue?

/area quality robustness
/kind enhancement
/priority 3

What would you like to be added:
A way to temporarily prevent node from getting deleted. For eg, when we cordon/drain a node and investigate it, sometimes it gets deleted automatically because it's not healthy. It would be really useful to be able to keep a node alive to investigate it and find the root cause of a given problem.

It could be something like an annotation to add to a node resource (ideally not machine since shoot owner might also find this useful). I also think this should add another annotation with something like a timeout threshold (that can be increased if needs be) to prevent people from forgetting a node with that state.

** Update 2Aug meeting with Etienne **

Investigation would be needed in following phases:

  • Pending (machine is not joining cases)
  • Unknown machine (pods not working so cordon/drain node and then inspect)
  • Running machine (pods not working, but machine Running , probably because the issue couldn't be tracked through a node condition)

Terminating WON'T need any investigation as the resources are in deletion phase, and could have been partly deleted by the time , machine is marked to be ignored from deletion.

Why is this needed:
This would be useful to troubleshoot nodes that are suddenly stop working as expected (RCA purposes)

@etiennnr etiennnr added the kind/enhancement Enhancement, improvement, extension label May 25, 2023
@gardener-robot gardener-robot added area/quality Output qualification (tests, checks, scans, automation in general, etc.) related area/robustness Robustness, reliability, resilience related priority/3 Priority (lower number equals higher priority) labels May 25, 2023
@himanshu-kun himanshu-kun changed the title Add a way to temporarily prevent node deletion Add a way to temporarily prevent node deletion a.k.a Freeze machine Aug 22, 2023
@rishabh-11
Copy link
Contributor

Post Grooming Decision:-

The annotation will have a timer. The machine will be deleted after the timer expires. Setting the annotation during rolling update is allowed. If the rolling update is cancelled/paused (option not yet available), the machine will still be considered frozen until the annotation is removed. We won't drain machines before freezing. No option to unfreeze the machine will be made available. Once the timer expires, the machine will be terminated.

Two options:-

  1. Have a separate machine deployment per worker pool dedicated to hosting frozen machines. This will not have a corresponding node group. It will not be a part of the rolling update. CA won't play a part in this, as no node group will be associated with the special machine deployment.
  2. To include this machine in the machine deployment replica count. Suspend any life cycle operations on this machine. This may cause the rolling update to be blocked. In this approach, CA will have to be adapted to ignore the frozen machines part of this machine deployment.

We need to check the code to figure out which option is more feasible.

@rishabh-11 rishabh-11 added the needs/planning Needs (more) planning with other MCM maintainers label Jan 11, 2024
@gardener-robot gardener-robot added the lifecycle/stale Nobody worked on this for 6 months (will further age) label Sep 20, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/quality Output qualification (tests, checks, scans, automation in general, etc.) related area/robustness Robustness, reliability, resilience related kind/enhancement Enhancement, improvement, extension lifecycle/stale Nobody worked on this for 6 months (will further age) needs/planning Needs (more) planning with other MCM maintainers priority/3 Priority (lower number equals higher priority)
Projects
None yet
Development

No branches or pull requests

3 participants