Bug: autoscaler-agent scaling algorithm is too volatile for larger computes #729

sharnoff · 2024-01-08T03:42:49Z

Problem description / Motivation

One of the blockers for allowing larger computes (ref neondatabase/cloud#9103) is improving the scaling algorithm.

Currently, because the scaling algorithm (a) recalculates the "goal" CU every 5s, via updated metrics, and (b) does not factor into account past metrics when calculating the "goal" CU:

It's easy to cause the goal CU to oscillate, resulting in a lot of effort spent scaling, with little net benefit
As computes get larger, the same percentage change in metrics is more likely to produce a change in (integer) goal CU — meaning that each 5s the metrics update is more likely to prompt scaling, and by a larger amount

In a perfect world, maybe this'd be fine. But in practice, the process of scaling actually consumes resources, and so is generally something we want to avoid doing frivolously.

See also: https://neondb.slack.com/archives/C03ETHV2KD1/p1704319422570509?thread_ts=1704316837.680979

Feature idea(s) / DoD

Scaling algorithm should be more stable over some time period, under some conditions.

This isn't a super well-defined goal — so this issue mostly just exists to track some improvement.

Implementation ideas

There's a couple directions we could take this.

One is to still not include any scaling history, and instead limit the size of a change (e.g., by no more than 1 CU at a time) and introduce rate-limiting on scaling. This wouldn't necessarily stop oscillation, but may reduce the impact.

The other is to include some history around recent metrics so that we have a longer time period to use for decision-making. This solution would probably be harder, but likely easier to understand and easier to produce better outcomes.

One possibly annoying piece of this is that we may need to change a substantial portion of the tests for pkg/agent/core. We probably want a way to override the "goal CU" and directly provide that.

Tasks

Pre-requisites

Give feedback

Implementation

Give feedback

... add tasks here as they come up
Options

Follow-ups

Give feedback

... add tasks here as they come up
Options

The text was updated successfully, but these errors were encountered:

Related to neondatabase#729 --- For more details, open the [Copilot Workspace session](https://copilot-workspace.githubnext.com/neondatabase/autoscaling/issues/729?shareId=XXXX-XXXX-XXXX-XXXX).

This is kind of a second take on #737, and a pre-req to #729 so that we can freely change how metrics are interpreted without needing to rewrite our unit tests in 'pkg/agent/core/state_test.go'.

sharnoff · 2024-11-08T01:05:30Z

There's an open RFC that will partially address this issue here: https://www.notion.so/neondatabase/131f189e004780b2915ef2fdb95bae6a

In short: the approach should reduce volatility by ~60% from what we have today, but it's only a fractional decrease — probably insufficient for very volatile workloads on much larger computes.

Omrigan · 2024-11-25T16:00:15Z

Testing #1148

Here is ZoneLoadAverage with stable_ratio=0.25 and mixed_ratio=0.25

Baseline:

Workload is i=0; T=10; while [ $i -lt 1800 ] ; do pgbench -c160 -j40 -b select-only -T$T $CONNSTR; s=$(( $RANDOM % 5 )); sleep $s; i=$(( $i + $s + $T )); done

sharnoff added t/bug Issue Type: Bug c/autoscaling/autoscaler-agent Component: autoscaling: autoscaler-agent labels Jan 8, 2024

sharnoff mentioned this issue Jan 13, 2024

agent/core: Dependency-inject ScalingAlgorithm #737

Closed

Omrigan mentioned this issue Oct 10, 2024

[Autogenerated by AI] Improve scaling algorithm stability #1100

Closed

sharnoff mentioned this issue Nov 4, 2024

agent/core: Allow setting goal CU in tests #1129

Open

sharnoff mentioned this issue Nov 11, 2024

daemon,agent: Use custom load1 for CPU scaling #1136

Draft

Omrigan self-assigned this Nov 18, 2024

Omrigan mentioned this issue Nov 21, 2024

agent/goalcu: implement ZoneLoadAverage #1148

Merged

Omrigan closed this as completed in #1148 Nov 27, 2024

Omrigan closed this as completed in cf514ed Nov 27, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bug: autoscaler-agent scaling algorithm is too volatile for larger computes #729

Bug: autoscaler-agent scaling algorithm is too volatile for larger computes #729

sharnoff commented Jan 8, 2024 •

edited

Loading

Pre-requisites

Implementation

Follow-ups

sharnoff commented Nov 8, 2024

Omrigan commented Nov 25, 2024

Bug: autoscaler-agent scaling algorithm is too volatile for larger computes #729

Bug: autoscaler-agent scaling algorithm is too volatile for larger computes #729

Comments

sharnoff commented Jan 8, 2024 • edited Loading

Problem description / Motivation

Feature idea(s) / DoD

Implementation ideas

Tasks

Pre-requisites

Implementation

Follow-ups

sharnoff commented Nov 8, 2024

Omrigan commented Nov 25, 2024

sharnoff commented Jan 8, 2024 •

edited

Loading