-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Tweaks to reduce false positive rate of A/B-Tests #4147
Conversation
Codecov ReportAll modified lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## main #4147 +/- ##
=======================================
Coverage 82.99% 82.99%
=======================================
Files 223 223
Lines 28448 28448
=======================================
Hits 23609 23609
Misses 4839 4839
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. |
c963c96
to
f73f3e7
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I still need to review the last commit, but figured I should send those before the PR changes.
9137de5
to
4bcd3e8
Compare
The minimal p-value that a permutation test can return is 1/#permutations. Should a p-value smaller than this be chosen, the test will never fail. Therefore, always pick 100 times as many permutations as needed. Signed-off-by: Patrick Roy <[email protected]>
Both tests were using identical microvm setups, so we factor it out into a shared fixture (with indirect parameterization for the number of vcpus). Signed-off-by: Patrick Roy <[email protected]>
Some tests are inexplicably unstable, where even two consecutive runs on the same machine give wildly different results. Until we find out why these are this unstable, there is no real point in running them, so do not fail the pipeline if these tests fail. Signed-off-by: Patrick Roy <[email protected]>
Instead of using taskset to pin iperf3 worker to specific cores, use iperf3's built-in --affinity argument to do so. Signed-off-by: Patrick Roy <[email protected]>
These asserts are already done in CpuMap when pinning threads to CPUs, so no need to repeat them here. Signed-off-by: Patrick Roy <[email protected]>
Previously, we concatenated time series for tests that produced data from multiple vcpus. Now, we instead overlay them, to accurately record the net throughput for each second interval, across all workers. This does not make a difference for permutation tests, but does allow us to exploit some properties of averages of time series across different parameterizations. Signed-off-by: Patrick Roy <[email protected]>
The idea here is the following: Intuitively, different parameterization of the same metric (e.g. the metric "snapshot restore latency" across different microvm configurations) are not completely independent. Therefore, we would generally expect impact to be reflected across multiple tested configurations (and particularly, if different configurations show mean changes in different directions, e.g. 1vcpu gives +5ms latency, but 2vcpu gives -5ms latency, then this is a further hint at a result being noise). To counteract this, we consider the average of the reported mean regressions, and only report failures for a metric whose average regression is at lesat 5%. Mathematically, this has the following justification: By the central limit theorem, the means of samples are normal distributed. Denote by A and B the distributions of the mean of samples from the 'A' and 'B' tests respectively. Under our null hypothesis, the distributions of the 'A' and 'B' samples are identical (although we dont know what the exact distributions are), meaning so are A and B, say A ~ B ~ N(mu, sigma^2). The difference of two normal distributions is also normal distributed, with the means being subtracted and the variances being added. Therefore, A - B ~ N(0, 2sigma^2). However, each parameterization (e.g. 1vcpu, 2vcpu, and so on) will potentially have different variances sigma^2. Here, we do the following assumption: For all parameterizations of the same metric, we have sigma^2/mu^2 = const. I have no mathematical justification for this (as it is a property of our tests), but empirically it works out. This means that (A-B)/mu ~ N(0, c), with c being a constant that is identical across all parameterizations of a metric. This means that we can combine the relative means across different parameterizations, and get a distributions whose expected value is 0, provided our null hypothesis was true. This is exactly what the code added in this commit verifies: The empirical average of our samples of this distribution only negligibly deviates from 0. Only if it deviates significantly (here by more than 0.05), we actually allow failures in that metric. For all tests but the snapshot restore test on x86_64/{5.10,6.1}, this allows us to completely drop the per-test noise threshold to 0. Signed-off-by: Patrick Roy <[email protected]>
They are remnant of when the performance tests were run without pytest parameterization, meaning all test scenarios were run inside of the same pytest test. This required large timeouts. Now with pytest parameterization, the runtime of each test is at most 30s, which conformtably fits into pytest's default timeout of 6min. Signed-off-by: Patrick Roy <[email protected]>
Since we no longer need to maintain separate sets of baselines for each release branch, it becomes feasible to run performance tests on release branches. These only trigger when PRs are merged. Signed-off-by: Patrick Roy <[email protected]>
It lives in the "CloudWatchMetrics" subdictionary, not in the "_aws" root dictionary. This was prevening metrics from showing up in the correct namespace in the cloudwatch web interface (they were listed under "local" instead of "PerfTests"). Signed-off-by: Patrick Roy <[email protected]>
Changes
This PR contains a variety of QoL refractors/small bug fixes for the performance A/B-testing setup. The most important change is in c3e2703, which introduces a mechanism for using correlations between results for a metric across multiple tests to identify (and ignore) outliers and thus reduce false positives.
(a previous version of this PR contained an increased threshold for snapshot restore tests, due to their high instability, but this instability has since been fixed in #4151)
License Acceptance
By submitting this pull request, I confirm that my contribution is made under
the terms of the Apache 2.0 license. For more information on following
Developer Certificate of Origin and signing off your commits, please check
CONTRIBUTING.md
.PR Checklist
CHANGELOG.md
.TODO
s link to an issue.rust-vmm
.