Allow for synchronous operation #68

jonathan-mayer · 2024-11-05T07:10:10Z

Issue

Currently the downscaler doesn't support multiple replicas/running multiple instances on the same workloads.

Problem to solve

If a workload is modified by another instance after the instance already got the workload from Kubernetes the downscaler will throw an error because the resourceVersion changed.
The same errors occur if the workload was modified manually or by something else during scaling.

Further details

Proposal

I don't think it is possible to handle both multiple instances and manual interference perfectly, at least not without erroring or logging sometimes or in some kind of way.
Because of this we should really change the error handling for this error to give more information to the user (possibly link to documentation on the topic).
The error could also be changed to a info or its logging could be turned off by a cli argument as it isn't broken and will just retry in the next scan cycle.

If we want the downscaler to allow for multiple instances we need to implement some way to make this error less likely/not occur with multiple instances of the downscaler.
One way to implement this would be using Kubernetes' leader election using leases. This would let only one downscaler would scale at a time. This would be redundant and stop the error from occurring because of multiple downscaler instances entirely.

Combining both of these would result in the error only being thrown on outside/manual intervention making it much less likely to occur.
Additionally when it occurs the user doesn't need to search for help with this intermittent error because it explains exactly what happend.

Who can address the issue

Go dev with some Kubernetes know-how

Other links/references

./docs/troubleshooting.md#synchronous-operation
https://github.com/kubernetes/client-go/blob/master/examples/leader-election/main.go#L112
https://kubernetes.io/docs/concepts/architecture/leases/#custom-workload

jonathan-mayer · 2024-11-05T10:05:52Z

There should be an option to run in "standalone mode". Or the non-"standalone mode" should be only activated when an option is passed. The helm chart should only give the needed permissions for the Kubernetes lease object in "multi replica mode".

Edit: as an uniqe id would have to be passed in for leader election, this argument could be used for "activating" the mode

samuel-esp · 2024-11-14T16:34:08Z

I've just found a similar problem on Py-Kube-Downscaler. Basically the problem you are describing could happen not only if you are using more than 1 replicas, but also when the downscaler iteration is too long (other entities inside the cluster could modify objects, like a HPA). I still need to check if the code we have here could have the same problem, I think the concurrency already implemented helps in mitigating this

caas-team/py-kube-downscaler#111

samuel-esp · 2024-11-14T16:47:50Z

The problem could be present since the resources is retrieved before being updated

https://github.com/caas-team/GoKubeDownscaler/blob/main/cmd/kubedownscaler/main.go#L113

https://github.com/caas-team/GoKubeDownscaler/blob/main/cmd/kubedownscaler/main.go#L128

To solve this problem we should catch the error, fetch the resource once again and then update the refreshed resource

jonathan-mayer · 2024-11-15T06:26:25Z

The problem could be present since the resources is retrieved before being updated

https://github.com/caas-team/GoKubeDownscaler/blob/main/cmd/kubedownscaler/main.go#L113

https://github.com/caas-team/GoKubeDownscaler/blob/main/cmd/kubedownscaler/main.go#L128

To solve this problem we should catch the error, fetch the resource once again and then update the refreshed resource

yes, this is exactly what causes the problem.
The only thing i think is that it isn't worth refetching the resource until the downscaler was able to do it without it being modified in-between when the downscaler runs every 30 seconds anyways.
I think as long as we can stop the error from occurring because of multiple replicas, and the error gets logged better than it is now it should be uncommon enough to not be an issue.

If we want the error to happen even less we could also just refetch the resource before it is being modified in the scan, which would reduce the chance of this happening (although like you said the concurrency already reduces the time from getting the resource to it being modified and this would probably 10x the amount of api calls).

samuel-esp · 2024-11-15T06:57:36Z

In case of multi replicas, this would be the best approach:

One way to implement this would be using Kubernetes' leader election using leases. This would let only one downscaler would scale at a time. This would be redundant and stop the error from occurring because of multiple downscaler instances entirely.

In case of long iterations:

We could let the user choose how to mitigate this behavior. For example introducing an argument like --retry-on-conflict (int) where the user submits the numbers of retries to perform (0 could be the default). In this way the user can still have control on the amount of api calls. I'll try implementing something like this in Py-Kube-Downscaler first.

jonathan-mayer added enhancement New feature or request low priority A issue which can wait and doesn't need to be resolved quickly and removed low priority A issue which can wait and doesn't need to be resolved quickly labels Nov 5, 2024

jonathan-mayer changed the title ~~Allow for synchronous operation/~~ Allow for synchronous operation Nov 5, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Allow for synchronous operation #68

Allow for synchronous operation #68

jonathan-mayer commented Nov 5, 2024 •

edited

Loading

jonathan-mayer commented Nov 5, 2024 •

edited

Loading

samuel-esp commented Nov 14, 2024 •

edited

Loading

samuel-esp commented Nov 14, 2024

jonathan-mayer commented Nov 15, 2024 •

edited

Loading

samuel-esp commented Nov 15, 2024

Allow for synchronous operation #68

Allow for synchronous operation #68

Comments

jonathan-mayer commented Nov 5, 2024 • edited Loading

Issue

Problem to solve

Further details

Proposal

Who can address the issue

Other links/references

jonathan-mayer commented Nov 5, 2024 • edited Loading

samuel-esp commented Nov 14, 2024 • edited Loading

samuel-esp commented Nov 14, 2024

jonathan-mayer commented Nov 15, 2024 • edited Loading

samuel-esp commented Nov 15, 2024

jonathan-mayer commented Nov 5, 2024 •

edited

Loading

jonathan-mayer commented Nov 5, 2024 •

edited

Loading

samuel-esp commented Nov 14, 2024 •

edited

Loading

jonathan-mayer commented Nov 15, 2024 •

edited

Loading