You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
We're currently flying blind w.r.t. how long scaling takes
Scaling latency should be part of any autoscaling SLOs
DoD
We should have histogram metrics recording:
end-to-end latency of scaling (down and up; cpu and memory)
latency of all the components:
requests to scheduler plugin (including retries)
requests to vm-monitor (including retries)
delay between initial NeonVM patch request and when status was updated
Implementation ideas
AFAICT the basic idea is that we store some extra info in agent/core.State and add some extra callbacks in agent/core.Config to increment some metrics when we determine that various parts of scaling (and the entire thing) have occurred.
More design work is required, because the edge cases are quite subtle.
Tasks
The content you are editing has changed. Please copy your edits and refresh the page.
Latency metrics measured by pkg/agent/core.State would have allowed us to notice the effects of #614 (i.e. core.State believed there was a super long-running request).
Motivation
Two reasons:
DoD
We should have histogram metrics recording:
Implementation ideas
AFAICT the basic idea is that we store some extra info in
agent/core.State
and add some extra callbacks inagent/core.Config
to increment some metrics when we determine that various parts of scaling (and the entire thing) have occurred.More design work is required, because the edge cases are quite subtle.
Tasks
Blockers
Implementation
Follow-ups
Other related tasks, Epics, and links
The text was updated successfully, but these errors were encountered: