Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Could we automatically detect/report performance regressions? #285

Open
maximecb opened this issue Jul 25, 2024 · 5 comments
Open

Could we automatically detect/report performance regressions? #285

maximecb opened this issue Jul 25, 2024 · 5 comments
Labels
enhancement New feature or request

Comments

@maximecb
Copy link
Contributor

maximecb commented Jul 25, 2024

Recently we've been surprised by multiple major performance regressions, and we're finding that it's very challenging to identify the root cause(s) after the fact. It would be really useful if yjit-metrics could help us automatically flag potential regressions.

I know this is not an easy or trivial thing to implement because there can often be false positives due to the inherent noise in measurements. I was thinking that since we have error bars on benchmarks, it might be possible to have a criteria such as, for example:

If the average time for the current run of a benchmark is below the previous average for the benchmark, and the error bars for both runs don't overlap, then flag a potential regression.

We could also have some kind of an adjustable threshold such that we allow a certain gap between the error bars before we report a regression (multiples of the largest or smallest of the stddevs? e.g. only report a slowdown if the gap between error bars is greater than 0.5 * min(stddev1, stddev2)). We could tune this criteria to reduce the probability of false positives.

This system wouldn't be foolproof, it may not detect very small regressions, but I think that it could still be helpful because for example, if a microbenchmark suddently slows down by 5-10%, it would automatically get flagged. Currently we rarely ever look at our microbenchmarks, so these things can go completely undetected... But we could have a microbenchmark for object allocation, for example, and get automatically notified if object allocation takes a big drop. @XrXr

If one or more regressions are detected, a message could be posted in the benchmark CI slack channel, and the bot could do an @here so that people in the channel are notified, or tag specific members of the YJIT team.

@maximecb maximecb added the enhancement New feature or request label Jul 25, 2024
@maximecb
Copy link
Contributor Author

In conjunction with this, I want to add an Object.new microbenchmark to yjit-bench: Shopify/yjit-bench#313

@XrXr
Copy link
Contributor

XrXr commented Jul 25, 2024

Over in rust, they have a bot that evaluates perf on every PR, for example: rust-lang/rust#125999 (comment) It's a very different environment, but it might still be worth it to take some time to sniff around to see if there is anything they do we could copy :) For example https://github.com/rust-lang/rustc-perf/blob/master/collector/README.md

@rwstauner
Copy link
Contributor

There is some code in this repo for "tripwires" that I think was meant for something like this but has been disabled... it might be worth digging that up to see what state it was in.

@maximecb
Copy link
Contributor Author

maximecb commented Jul 25, 2024

Good suggestion @XrXr

There is some code in this repo for "tripwires" that I think was meant for something like this but has been disabled... it might be worth digging that up to see what state it was in.

Yes I think the reason it was disabled is that it was reporting too many false positives. However, I don't think it was doing anything fancy. I think if we make use of error bars, and we use better heuristics, we can probably have a better detector. In general, our numbers are quite stable over time.

If anything, even if it only flagged very big and very obvious regressions, it might still be useful, because even big 5% regressions can easily go unseen if we're not religiously paying attention, because I mean, a benchmark is showing 1.76% speedup. Was it 1.81% last week? Don't remember.

@maximecb
Copy link
Contributor Author

As a sidenote we could also play special tricks like excluding benchmarks that have very big error bars, such as ruby-lsp for example. We can tune our heuristics until we get something that works well.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants