Epic: Scaling latency metrics #594

sharnoff · 2023-10-30T04:24:57Z

Motivation

Two reasons:

We're currently flying blind w.r.t. how long scaling takes
Scaling latency should be part of any autoscaling SLOs

DoD

We should have histogram metrics recording:

end-to-end latency of scaling (down and up; cpu and memory)
latency of all the components:
- requests to scheduler plugin (including retries)
- requests to vm-monitor (including retries)
- delay between initial NeonVM patch request and when status was updated

Implementation ideas

AFAICT the basic idea is that we store some extra info in agent/core.State and add some extra callbacks in agent/core.Config to increment some metrics when we determine that various parts of scaling (and the entire thing) have occurred.

More design work is required, because the edge cases are quite subtle.

Tasks

Blockers

Give feedback

Bug: VM memory scale-up leaves "Scaling" phase before updating status resources #453

c/autoscaling/neonvm t/bug
Bug: Fractional CPU scale-down temporarily exceeds spec maximum #462

c/autoscaling/neonvm t/bug
Bug: pkg/agent/state should use VM status, not spec to represent "current" resources #592

1 of 2

c/autoscaling/autoscaler-agent t/bug
Options

Implementation

Give feedback

Decide how we should measure latency (difficult due to eventual consistency)
agent/core: Measure scaling latency
Options

Follow-ups

Give feedback

Other related tasks, Epics, and links

Proposed RFC

The text was updated successfully, but these errors were encountered:

sharnoff · 2023-11-07T17:48:00Z

Latency metrics measured by pkg/agent/core.State would have allowed us to notice the effects of #614 (i.e. core.State believed there was a super long-running request).

sharnoff · 2024-06-10T15:13:33Z

Status: waiting on @sharnoff and @stradig to review the internal RFC.

sharnoff added t/Epic Issue type: Epic c/autoscaling/autoscaler-agent Component: autoscaling: autoscaler-agent labels Oct 30, 2023

sharnoff mentioned this issue Dec 15, 2023

agent/core: Use VM spec as source of truth for current resources #350

Open

This was referenced Jan 24, 2024

Epic: VM startup latency metrics & SLOs #759

Closed

Track more liveness in "autoscaling stuck" #770

Closed

Omrigan self-assigned this Jun 4, 2024

Omrigan mentioned this issue Jun 20, 2024

Implement scaling latency metrics through revisions #983

Merged

Omrigan closed this as completed in #983 Jul 22, 2024

Omrigan closed this as completed in 4395a93 Jul 22, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Epic: Scaling latency metrics #594

Epic: Scaling latency metrics #594

sharnoff commented Oct 30, 2023 •

edited by Omrigan

Loading

Blockers

Implementation

Follow-ups

sharnoff commented Nov 7, 2023

sharnoff commented Jun 10, 2024

Epic: Scaling latency metrics #594

Epic: Scaling latency metrics #594

Comments

sharnoff commented Oct 30, 2023 • edited by Omrigan Loading

Motivation

DoD

Implementation ideas

Tasks

Blockers

Implementation

Follow-ups

Other related tasks, Epics, and links

sharnoff commented Nov 7, 2023

sharnoff commented Jun 10, 2024

sharnoff commented Oct 30, 2023 •

edited by Omrigan

Loading