Add a way to temporarily prevent node deletion a.k.a Freeze machine #818
Labels
area/quality
Output qualification (tests, checks, scans, automation in general, etc.) related
area/robustness
Robustness, reliability, resilience related
kind/enhancement
Enhancement, improvement, extension
lifecycle/stale
Nobody worked on this for 6 months (will further age)
needs/planning
Needs (more) planning with other MCM maintainers
priority/3
Priority (lower number equals higher priority)
How to categorize this issue?
/area quality robustness
/kind enhancement
/priority 3
What would you like to be added:
A way to temporarily prevent node from getting deleted. For eg, when we cordon/drain a node and investigate it, sometimes it gets deleted automatically because it's not healthy. It would be really useful to be able to keep a node alive to investigate it and find the root cause of a given problem.
It could be something like an annotation to add to a
node
resource (ideally notmachine
since shoot owner might also find this useful). I also think this should add another annotation with something like a timeout threshold (that can be increased if needs be) to prevent people from forgetting a node with that state.** Update 2Aug meeting with Etienne **
Investigation would be needed in following phases:
Pending
(machine is not joining cases)Unknown
machine (pods not working so cordon/drain node and then inspect)Running
machine (pods not working, but machineRunning
, probably because the issue couldn't be tracked through a node condition)Terminating
WON'T need any investigation as the resources are in deletion phase, and could have been partly deleted by the time , machine is marked to be ignored from deletion.Why is this needed:
This would be useful to troubleshoot nodes that are suddenly stop working as expected (RCA purposes)
The text was updated successfully, but these errors were encountered: