Track more liveness in "autoscaling stuck" #770

sharnoff · 2024-01-29T22:05:45Z

Problem description / Motivation

Currently, "autoscaling stuck" metrics and logs use the following definition:

An autoscaling-enabled VM is "stuck" if there has not been a successful health check response for the last 20s.

Current implementation is here.

This misses various other ways that autoscaling may currently be failing for a particular VM, some of which we've seen in prod (e.g. due to missing a pod start event, the scheduler doesn't know about a particular VM and always returns 404 to the autoscaler-agent).

Feature idea(s) / DoD

Some other types of "stuckness" we should look at including:

Requests to the scheduler plugin are failing
Requests to update the VM object are failing
Other requests to the vm-monitor are failing

And possibly also (although more difficult):

Scheduler plugin consistently denying desired upscaling
vm-monitor consistently denying desired downscaling (currently sometimes expected in practice)

See also #594. Tracking the delay between VM object change and VM status change is currently blocked on #592.

shayanh · 2024-02-22T16:24:45Z

The mentioned types of VM stuckness are covered by other metrics:

Requests to the scheduler plugin are failing: covered by Agent → Scheduler plugin request errors
Requests to update the VM object are failing: covered by Agent → NeonVM API request errors
Other requests to the vm-monitor are failing: covered by Agent → vm-monitor request errors

I understand this is more fine-grained in a per-VM granularity level, but there is some duplication, particularly, if we are looking forward to setup alerts based on this. @sharnoff wdyt?

shayanh · 2024-02-22T18:51:09Z

@sharnoff and I discussed this. It's partially covered by the other metrics and having this makes it much easier to track down the progress with each VM.

Now a VM is considered stuck if: - Health checks to vm-monitor are failing for the configured period of time - There are `n` failed or denied requests to the scheduler over the last `t` seconds - There are `n` failed or denied requests to the vm-monitor over the last `t` seconds - There are `n` failed requests to the NeonVM over the last `t` seconds I have defined the values `n` and `t` for each component in the config map. Comment out if you think the chosen values could be better. Part of a #770 --------- Signed-off-by: Oleg Vasilev <[email protected]> Co-authored-by: Oleg Vasilev <[email protected]>

Omrigan · 2024-04-24T15:43:09Z

Status: after the deployment to prod yesterday there is a non-zero number of stuck VM on the dashboard.

For that VMs vm-monitor consistently rejects downscaling request, and those VMs are now considered stuck.
The question is: do we want to investigate why autoscaler-agent wants to downscale VMs, but vm-monitor rejects the request, or we just want to consider this situation normal, and remove it from the definition of stuckness?

Thread: https://neondb.slack.com/archives/C03F5SM1N02/p1713893262823009

stradig · 2024-05-06T15:34:48Z

Implementation finished, some follup in #926

Fixes #926, a follow-up for the #770. The defintion for the VM stuckness was changed to include denied downscale request in the following commit: commit fdf0133 Author: Shayan Hosseini <[email protected]> Date: Sat Apr 6 09:25:01 2024 -0400 agent: track more liveness in vm-stuck metrics (#855) This resulted in the consistent firing of the alert. We should actually treat the denied downscale as part of the normal operation. Signed-off-by: Oleg Vasilev <[email protected]>

Fixes #926, a follow-up for the #770. The defintion for the VM stuckness was changed to include denied downscale request in the following commit: commit fdf0133 Author: Shayan Hosseini <[email protected]> Date: Sat Apr 6 09:25:01 2024 -0400 agent: track more liveness in vm-stuck metrics (#855) This resulted in the consistent firing of the alert. We should actually treat the denied downscale as part of the normal operation. This can happen due to mismatching policy of what is an acceptable level memory usage in autoscaler-agent vs vm_monitor. Signed-off-by: Oleg Vasilev <[email protected]>

sharnoff added a/reliability Area: relates to reliability of the service c/autoscaling/autoscaler-agent Component: autoscaling: autoscaler-agent labels Jan 29, 2024

sharnoff assigned Omrigan Jan 29, 2024

sharnoff mentioned this issue Feb 5, 2024

autoscaler-agent "stuck" metric should use a pair of counters #789

Open

shayanh assigned shayanh and unassigned Omrigan Feb 22, 2024

shayanh mentioned this issue Mar 11, 2024

agent: track more liveness in vm-stuck metrics #855

Merged

Omrigan assigned Omrigan and unassigned shayanh Apr 6, 2024

Omrigan mentioned this issue May 6, 2024

Bug: stop treating downscale denies as an error #926

Closed

stradig closed this as completed May 6, 2024

Omrigan mentioned this issue May 6, 2024

agent: don't treat downscale denies as failed request #927

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Track more liveness in "autoscaling stuck" #770

Track more liveness in "autoscaling stuck" #770

sharnoff commented Jan 29, 2024

shayanh commented Feb 22, 2024 •

edited

Loading

shayanh commented Feb 22, 2024 •

edited

Loading

Omrigan commented Apr 24, 2024

stradig commented May 6, 2024

Track more liveness in "autoscaling stuck" #770

Track more liveness in "autoscaling stuck" #770

Comments

sharnoff commented Jan 29, 2024

Problem description / Motivation

Feature idea(s) / DoD

shayanh commented Feb 22, 2024 • edited Loading

shayanh commented Feb 22, 2024 • edited Loading

Omrigan commented Apr 24, 2024

stradig commented May 6, 2024

shayanh commented Feb 22, 2024 •

edited

Loading

shayanh commented Feb 22, 2024 •

edited

Loading