Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement scaling latency metrics through revisions #983

Merged
merged 62 commits into from
Jul 22, 2024
Merged

Conversation

Omrigan
Copy link
Contributor

@Omrigan Omrigan commented Jun 20, 2024

Newly introduced Revision is (basically) an integer value, which can be associated with different parts of a system: Monitor, Plugin, NeonVM and scaling algorithm itself. There are two types of such association:

  • TargetRevision corresponds to the desired state of a particular part.
  • CurrentRevision corresponds to the state which was already achieved.

As the system makes progress TargetRevision is propagated to CurrentRevision field, simultaneously tracking how long it took for this propagation.

When the same revision value is passed through multiple parts, we can measure end-to-end latency of multi-component operations.

Fixes #594.

@Omrigan
Copy link
Contributor Author

Omrigan commented Jun 20, 2024

@sharnoff can you take a look, if you have a moment? This is very WIP, I didn't update the tests or tested it.

Looking at the logic around agent's state -> does this logic of clock propagation looks sound to you? Can I make it simpler and more compact somehow?

Copy link
Member

@sharnoff sharnoff left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Had a look - I think I'm struggling to get the big picture of what the intended flow within the autoscaler-agent state machine is (mostly because the existing code is quite complicated, and the new stuff doesn't yet have comments).

Broadly I think it should be ok, but do you have a quick (1-2 paragraph) explanation of how the clocks are supposed to flow?

(IIUC, currently the "desired logical time" stored in the plugin/monitor/neonvm state is basically the logical time of the most recent scaling that the component has completed successfully? If so, AFAICT there are still some subtle edge cases, but they shouldn't require major changes to accommodate.)

Would be good to discuss on Monday 😅


All that aside, one thing I noticed: In a lot of places, there's variables like desiredClock or desiredLogicalTime -- IMO, this reads with desired as an adjective affecting the clock/logical time, which I guess is not what's intended. I wonder if it'd be better to refer to these more like timestamps, e.g. "tsOfDesired" or "desiredAtTime" etc. (or event just "desiredAt"?)

@Omrigan Omrigan force-pushed the oleg/latency-metrics branch from adf115e to 7331087 Compare July 4, 2024 10:38
@Omrigan Omrigan marked this pull request as ready for review July 8, 2024 13:13
@Omrigan Omrigan requested a review from sharnoff July 8, 2024 15:30
Copy link
Member

@sharnoff sharnoff left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some thoughts. Not the most thorough review -- rough expectations: next round will include more nits, then one more as final thoughts.

pkg/agent/core/state.go Outdated Show resolved Hide resolved
pkg/agent/core/state.go Outdated Show resolved Hide resolved
neonvm/apis/neonvm/v1/virtualmachine_types.go Outdated Show resolved Hide resolved
neonvm/apis/neonvm/v1/virtualmachine_types.go Outdated Show resolved Hide resolved
neonvm/apis/neonvm/v1/virtualmachine_types.go Outdated Show resolved Hide resolved
pkg/agent/core/logiclock/logiclock.go Outdated Show resolved Hide resolved
.golangci.yml Outdated Show resolved Hide resolved
pkg/agent/core/state_test.go Outdated Show resolved Hide resolved
pkg/agent/executor/exec_monitor.go Outdated Show resolved Hide resolved
pkg/agent/runner.go Outdated Show resolved Hide resolved
Omrigan added 14 commits July 9, 2024 15:05
Otherwise, the following fails:

~> go list -m all
go: github.com/optiopay/[email protected]: invalid version: unknown revision 000000000000

Signed-off-by: Oleg Vasilev <[email protected]>
Signed-off-by: Oleg Vasilev <[email protected]>
Signed-off-by: Oleg Vasilev <[email protected]>
Signed-off-by: Oleg Vasilev <[email protected]>
Signed-off-by: Oleg Vasilev <[email protected]>
Signed-off-by: Oleg Vasilev <[email protected]>
Signed-off-by: Oleg Vasilev <[email protected]>
Signed-off-by: Oleg Vasilev <[email protected]>
Signed-off-by: Oleg Vasilev <[email protected]>
Signed-off-by: Oleg Vasilev <[email protected]>
Signed-off-by: Oleg Vasilev <[email protected]>
@Omrigan Omrigan force-pushed the oleg/latency-metrics branch from fc37d26 to 63605e1 Compare July 9, 2024 11:19
@Omrigan Omrigan changed the base branch from main to oleg/devex July 9, 2024 11:19
Omrigan added 2 commits July 9, 2024 16:09
Signed-off-by: Oleg Vasilev <[email protected]>
@Omrigan Omrigan requested a review from sharnoff July 9, 2024 13:43
Base automatically changed from oleg/devex to main July 10, 2024 10:33
Signed-off-by: Oleg Vasilev <[email protected]>
@sharnoff sharnoff mentioned this pull request Jul 19, 2024
Copy link
Member

@sharnoff sharnoff left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

basically final review, a few questions left

pkg/agent/core/state.go Outdated Show resolved Hide resolved
pkg/agent/core/state.go Show resolved Hide resolved
pkg/agent/core/state.go Show resolved Hide resolved
Comment on lines +1199 to +1203
revsource.Propagate(now,
targetRevision,
&h.s.Monitor.CurrentRevision,
h.s.Config.ObservabilityCallbacks.MonitorLatency,
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar question here as with the scheduler - what happens when downscale is denied?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The propagation doesn't happen -> we don't measure the latency.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here we still measure the latency for the vm-monitor even though it was denied, right? Is that intentional? (if so: what are the expected semantics for component latency?)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this is intentional, because denial is also a success. I don't expect denied vs allowed yielding different distributions of latency.

what are the expected semantics for component latency?

Well, the distribution of latency for successful requests 🙃

What here confuses you, perhaps I am missing something?

Copy link
Member

@sharnoff sharnoff Jul 22, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, the distribution of latency for successful requests

My current understanding is either:

  • Component latency should only look at individual request latency
  • Component latency should only be related to end-to-end scaling

If it's the first one, then presumably we shouldn't be using revsource for this (we'd just want a simple histogram metric looking at the time difference since we started the request, right?).

If it's the second one, then we should treat denial as failure, because that doesn't get us closer to scaling.

Does that make sense?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is "Component latency should only look at individual request latency".

The implementation will remain as-is for now, and later can be simplified.

I should check the metric name, that it is clearly expresses semantics.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think "autoscaling_agent_plugin_latency_seconds" fits fine.

Counting retries can be "autoscaling_agent_plugin_phase_seconds" or something.

pkg/agent/runner.go Outdated Show resolved Hide resolved
@Omrigan Omrigan requested a review from sharnoff July 21, 2024 12:25
Omrigan added 6 commits July 22, 2024 01:13
Signed-off-by: Oleg Vasilev <[email protected]>
Signed-off-by: Oleg Vasilev <[email protected]>
Signed-off-by: Oleg Vasilev <[email protected]>
Signed-off-by: Oleg Vasilev <[email protected]>
Signed-off-by: Oleg Vasilev <[email protected]>
pkg/agent/core/state.go Outdated Show resolved Hide resolved
Signed-off-by: Oleg Vasilev <[email protected]>
@Omrigan Omrigan requested a review from sharnoff July 22, 2024 16:37
sharnoff added a commit that referenced this pull request Jul 22, 2024
Noticed while reviewing a new test in #983 that triggers this warning.
@Omrigan Omrigan enabled auto-merge (squash) July 22, 2024 20:47
@Omrigan Omrigan merged commit 4395a93 into main Jul 22, 2024
15 checks passed
@Omrigan Omrigan deleted the oleg/latency-metrics branch July 22, 2024 21:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Epic: Scaling latency metrics
2 participants