-
Notifications
You must be signed in to change notification settings - Fork 21
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Track more liveness in "autoscaling stuck" #770
Comments
The mentioned types of VM stuckness are covered by other metrics:
I understand this is more fine-grained in a per-VM granularity level, but there is some duplication, particularly, if we are looking forward to setup alerts based on this. @sharnoff wdyt? |
@sharnoff and I discussed this. It's partially covered by the other metrics and having this makes it much easier to track down the progress with each VM. |
Now a VM is considered stuck if: - Health checks to vm-monitor are failing for the configured period of time - There are `n` failed or denied requests to the scheduler over the last `t` seconds - There are `n` failed or denied requests to the vm-monitor over the last `t` seconds - There are `n` failed requests to the NeonVM over the last `t` seconds I have defined the values `n` and `t` for each component in the config map. Comment out if you think the chosen values could be better. Part of a #770 --------- Signed-off-by: Oleg Vasilev <[email protected]> Co-authored-by: Oleg Vasilev <[email protected]>
Status: after the deployment to prod yesterday there is a non-zero number of stuck VM on the dashboard. For that VMs vm-monitor consistently rejects downscaling request, and those VMs are now considered stuck. Thread: https://neondb.slack.com/archives/C03F5SM1N02/p1713893262823009 |
Implementation finished, some follup in #926 |
Fixes #926, a follow-up for the #770. The defintion for the VM stuckness was changed to include denied downscale request in the following commit: commit fdf0133 Author: Shayan Hosseini <[email protected]> Date: Sat Apr 6 09:25:01 2024 -0400 agent: track more liveness in vm-stuck metrics (#855) This resulted in the consistent firing of the alert. We should actually treat the denied downscale as part of the normal operation. Signed-off-by: Oleg Vasilev <[email protected]>
Fixes #926, a follow-up for the #770. The defintion for the VM stuckness was changed to include denied downscale request in the following commit: commit fdf0133 Author: Shayan Hosseini <[email protected]> Date: Sat Apr 6 09:25:01 2024 -0400 agent: track more liveness in vm-stuck metrics (#855) This resulted in the consistent firing of the alert. We should actually treat the denied downscale as part of the normal operation. This can happen due to mismatching policy of what is an acceptable level memory usage in autoscaler-agent vs vm_monitor. Signed-off-by: Oleg Vasilev <[email protected]>
Problem description / Motivation
Currently, "autoscaling stuck" metrics and logs use the following definition:
Current implementation is here.
This misses various other ways that autoscaling may currently be failing for a particular VM, some of which we've seen in prod (e.g. due to missing a pod start event, the scheduler doesn't know about a particular VM and always returns 404 to the autoscaler-agent).
Feature idea(s) / DoD
Some other types of "stuckness" we should look at including:
And possibly also (although more difficult):
See also #594. Tracking the delay between VM object change and VM status change is currently blocked on #592.
The text was updated successfully, but these errors were encountered: