-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Allow for synchronous operation #68
Comments
There should be an option to run in "standalone mode". Or the non-"standalone mode" should be only activated when an option is passed. The helm chart should only give the needed permissions for the Kubernetes lease object in "multi replica mode". Edit: as an uniqe id would have to be passed in for leader election, this argument could be used for "activating" the mode |
I've just found a similar problem on Py-Kube-Downscaler. Basically the problem you are describing could happen not only if you are using more than 1 replicas, but also when the downscaler iteration is too long (other entities inside the cluster could modify objects, like a HPA). I still need to check if the code we have here could have the same problem, I think the concurrency already implemented helps in mitigating this |
The problem could be present since the resources is retrieved before being updated https://github.com/caas-team/GoKubeDownscaler/blob/main/cmd/kubedownscaler/main.go#L113 https://github.com/caas-team/GoKubeDownscaler/blob/main/cmd/kubedownscaler/main.go#L128 To solve this problem we should catch the error, fetch the resource once again and then update the refreshed resource |
yes, this is exactly what causes the problem. If we want the error to happen even less we could also just refetch the resource before it is being modified in the scan, which would reduce the chance of this happening (although like you said the concurrency already reduces the time from getting the resource to it being modified and this would probably 10x the amount of api calls). |
In case of multi replicas, this would be the best approach:
In case of long iterations: We could let the user choose how to mitigate this behavior. For example introducing an argument like |
Issue
Currently the downscaler doesn't support multiple replicas/running multiple instances on the same workloads.
Problem to solve
If a workload is modified by another instance after the instance already got the workload from Kubernetes the downscaler will throw an error because the resourceVersion changed.
The same errors occur if the workload was modified manually or by something else during scaling.
Further details
Proposal
I don't think it is possible to handle both multiple instances and manual interference perfectly, at least not without erroring or logging sometimes or in some kind of way.
Because of this we should really change the error handling for this error to give more information to the user (possibly link to documentation on the topic).
The error could also be changed to a info or its logging could be turned off by a cli argument as it isn't broken and will just retry in the next scan cycle.
If we want the downscaler to allow for multiple instances we need to implement some way to make this error less likely/not occur with multiple instances of the downscaler.
One way to implement this would be using Kubernetes' leader election using leases. This would let only one downscaler would scale at a time. This would be redundant and stop the error from occurring because of multiple downscaler instances entirely.
Combining both of these would result in the error only being thrown on outside/manual intervention making it much less likely to occur.
Additionally when it occurs the user doesn't need to search for help with this intermittent error because it explains exactly what happend.
Who can address the issue
Go dev with some Kubernetes know-how
Other links/references
./docs/troubleshooting.md#synchronous-operation
https://github.com/kubernetes/client-go/blob/master/examples/leader-election/main.go#L112
https://kubernetes.io/docs/concepts/architecture/leases/#custom-workload
The text was updated successfully, but these errors were encountered: