Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Track more liveness in "autoscaling stuck" #770

Closed
sharnoff opened this issue Jan 29, 2024 · 4 comments
Closed

Track more liveness in "autoscaling stuck" #770

sharnoff opened this issue Jan 29, 2024 · 4 comments
Assignees
Labels
a/reliability Area: relates to reliability of the service c/autoscaling/autoscaler-agent Component: autoscaling: autoscaler-agent

Comments

@sharnoff
Copy link
Member

Problem description / Motivation

Currently, "autoscaling stuck" metrics and logs use the following definition:

An autoscaling-enabled VM is "stuck" if there has not been a successful health check response for the last 20s.

Current implementation is here.

This misses various other ways that autoscaling may currently be failing for a particular VM, some of which we've seen in prod (e.g. due to missing a pod start event, the scheduler doesn't know about a particular VM and always returns 404 to the autoscaler-agent).

Feature idea(s) / DoD

Some other types of "stuckness" we should look at including:

  • Requests to the scheduler plugin are failing
  • Requests to update the VM object are failing
  • Other requests to the vm-monitor are failing

And possibly also (although more difficult):

  • Scheduler plugin consistently denying desired upscaling
  • vm-monitor consistently denying desired downscaling (currently sometimes expected in practice)

See also #594. Tracking the delay between VM object change and VM status change is currently blocked on #592.

@sharnoff sharnoff added a/reliability Area: relates to reliability of the service c/autoscaling/autoscaler-agent Component: autoscaling: autoscaler-agent labels Jan 29, 2024
@shayanh
Copy link

shayanh commented Feb 22, 2024

The mentioned types of VM stuckness are covered by other metrics:

I understand this is more fine-grained in a per-VM granularity level, but there is some duplication, particularly, if we are looking forward to setup alerts based on this. @sharnoff wdyt?

@shayanh shayanh assigned shayanh and unassigned Omrigan Feb 22, 2024
@shayanh
Copy link

shayanh commented Feb 22, 2024

@sharnoff and I discussed this. It's partially covered by the other metrics and having this makes it much easier to track down the progress with each VM.

@Omrigan Omrigan assigned Omrigan and unassigned shayanh Apr 6, 2024
Omrigan added a commit that referenced this issue Apr 6, 2024
Now a VM is considered stuck if:
- Health checks to vm-monitor are failing for the configured period of
time
- There are `n` failed or denied requests to the scheduler over the last
`t` seconds
- There are `n` failed or denied requests to the vm-monitor over the
last `t` seconds
- There are `n` failed requests to the NeonVM over the last `t` seconds

I have defined the values `n` and `t` for each component in the config
map. Comment out if you think the chosen values could be better.

Part of a #770

---------

Signed-off-by: Oleg Vasilev <[email protected]>
Co-authored-by: Oleg Vasilev <[email protected]>
@Omrigan
Copy link
Contributor

Omrigan commented Apr 24, 2024

Status: after the deployment to prod yesterday there is a non-zero number of stuck VM on the dashboard.

For that VMs vm-monitor consistently rejects downscaling request, and those VMs are now considered stuck.
The question is: do we want to investigate why autoscaler-agent wants to downscale VMs, but vm-monitor rejects the request, or we just want to consider this situation normal, and remove it from the definition of stuckness?

Thread: https://neondb.slack.com/archives/C03F5SM1N02/p1713893262823009

@stradig
Copy link
Contributor

stradig commented May 6, 2024

Implementation finished, some follup in #926

@stradig stradig closed this as completed May 6, 2024
Omrigan added a commit that referenced this issue May 6, 2024
Fixes #926, a follow-up for the #770.

The defintion for the VM stuckness was changed to include
denied downscale request in the following commit:

    commit fdf0133
    Author: Shayan Hosseini <[email protected]>
    Date:   Sat Apr 6 09:25:01 2024 -0400

    agent: track more liveness in vm-stuck metrics (#855)

This resulted in the consistent firing of the alert.

We should actually treat the denied downscale as part of the normal
operation.

Signed-off-by: Oleg Vasilev <[email protected]>
Omrigan added a commit that referenced this issue May 7, 2024
Fixes #926, a follow-up for the #770.

The defintion for the VM stuckness was changed to include denied
downscale request in the following commit:

    commit fdf0133
    Author: Shayan Hosseini <[email protected]>
    Date:   Sat Apr 6 09:25:01 2024 -0400

    agent: track more liveness in vm-stuck metrics (#855)

This resulted in the consistent firing of the alert.

We should actually treat the denied downscale as part of the normal
operation. This can happen due to mismatching policy of what is an
acceptable level memory usage in autoscaler-agent vs vm_monitor.

Signed-off-by: Oleg Vasilev <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
a/reliability Area: relates to reliability of the service c/autoscaling/autoscaler-agent Component: autoscaling: autoscaler-agent
Projects
None yet
Development

No branches or pull requests

4 participants