Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

autoscaler-agent "stuck" metric should use a pair of counters #789

Open
sharnoff opened this issue Feb 5, 2024 · 0 comments
Open

autoscaler-agent "stuck" metric should use a pair of counters #789

sharnoff opened this issue Feb 5, 2024 · 0 comments
Labels
c/autoscaling/autoscaler-agent Component: autoscaling: autoscaler-agent t/feature Issue type: feature, for new features or requests

Comments

@sharnoff
Copy link
Member

sharnoff commented Feb 5, 2024

Problem description / Motivation

Currently we use a single gauge for the "autoscaling stuck" metric exposed by the autoscaler-agent.

This is nice because it's simple.

However, a downside of this is that we only know how many VMs were stuck at a particular moment in time — we don't know, for example, how many VMs became stuck between two points in time.

Knowing the number of distinct VMs that became stuck betweeen two points in time would require either (a) looking at the logs, or (b) high-cardinality metrics. But if we're ok having duplicate entries for the same VM, we can just look at the total number of times any VM became stuck, which can be represented by gauge.

Feature idea(s) / DoD

DoD is that we need some way to get the number of VMs that became stuck between two timestamps, rather than just the number that are currently stuck — without this, we're significantly under-counting the rate/quantity of stuck VMs.

Implementation ideas

Continuing from the motivation, if we have a gauge for the number of times any VM became un-stuck, we can subtract it from the number of times VMs have become stuck to get the current number of stuck VMs, replacing / augmenting the current metric.

Alternatively, because stuckness is currently represented by the autoscaling_agent_runners_current{state=...} metric, we could introduce a "runner state transitions" metric, where new_state="stuck" means the VM has become stuck, and old_state="stuck" means the VM is unstuck.

This would similarly allow us to unify the handling for panicked/errored runners (instead of having separate autoscaling_agent_runner_fatal_errors_total / autoscaling_agent_runner_thread_panics_total)

Related issues

@sharnoff sharnoff added t/feature Issue type: feature, for new features or requests c/autoscaling/autoscaler-agent Component: autoscaling: autoscaler-agent labels Feb 5, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
c/autoscaling/autoscaler-agent Component: autoscaling: autoscaler-agent t/feature Issue type: feature, for new features or requests
Projects
None yet
Development

No branches or pull requests

1 participant