Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parallelize the eviction of pods with volumes #848

Open
timuthy opened this issue Sep 8, 2023 · 4 comments
Open

Parallelize the eviction of pods with volumes #848

timuthy opened this issue Sep 8, 2023 · 4 comments
Labels
area/performance Performance (across all domains, such as control plane, networking, storage, etc.) related kind/enhancement Enhancement, improvement, extension lifecycle/stale Nobody worked on this for 6 months (will further age) priority/3 Priority (lower number equals higher priority)

Comments

@timuthy
Copy link
Member

timuthy commented Sep 8, 2023

How to categorize this issue?

/area performance
/kind enhancement
/priority 3

What would you like to be added:
MCM should provide a knob to configure the degree of parallel evictions for pods with volumes.

Why is this needed:
#262 established a serial eviction of pods with volumes to make the overall node drain process faster, esp. for cloud providers where many parallel detach/attach operations lead to rate limits and huge back-offs.

On some infrastructures and to some degree, a parallel eviction for pods with volumes may lead to a beneficial performance boost. Today, shoot clusters with many nodes often need a considerable amount of time to perform rolling updates. We see this aspect being one of the root causes that can be improved.

@timuthy timuthy added the kind/enhancement Enhancement, improvement, extension label Sep 8, 2023
@gardener-robot gardener-robot added area/performance Performance (across all domains, such as control plane, networking, storage, etc.) related priority/3 Priority (lower number equals higher priority) labels Sep 8, 2023
@timuthy
Copy link
Member Author

timuthy commented Nov 2, 2023

Any opinion @gardener/mcm-maintainers?

@elankath
Copy link
Contributor

elankath commented Nov 2, 2023

Hi Tim, we can support this though I am doubtful whether we should make it configurable in the shoot YAML as it can possibly lead to severe degradation if operator configures a high value and then a fair amount of effort diagnosing/trouble-shooting such issues.

However, I think we can introduce a fixed degree of parallelism in evicting Pods with PVs after relevant testing of the behaviour on problematic providers like Azure. Now that we have implemented #781, today we wait for all volumes to be detached from the Node before proceeding to VM deletion. Hence those edge cases where still attached volumes cause the attach/detach controller to go into timeouts, is ameliorated.

@timuthy
Copy link
Member Author

timuthy commented Nov 2, 2023

Thanks for the feedback @elankath. It wasn't meant to be an option for shoot owners. The degree of parallelism can also be configured by Gardenlet via its config.

@elankath
Copy link
Contributor

elankath commented Nov 2, 2023

That fine. A "hidden knob" like a CLI option parallel-eviction-limit=X should be OK.

@gardener-robot gardener-robot added the lifecycle/stale Nobody worked on this for 6 months (will further age) label Sep 13, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/performance Performance (across all domains, such as control plane, networking, storage, etc.) related kind/enhancement Enhancement, improvement, extension lifecycle/stale Nobody worked on this for 6 months (will further age) priority/3 Priority (lower number equals higher priority)
Projects
None yet
Development

No branches or pull requests

3 participants