Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tweaks to reduce false positive rate of A/B-Tests #4147

Merged
merged 10 commits into from
Oct 9, 2023

Conversation

roypat
Copy link
Contributor

@roypat roypat commented Oct 3, 2023

Changes

This PR contains a variety of QoL refractors/small bug fixes for the performance A/B-testing setup. The most important change is in c3e2703, which introduces a mechanism for using correlations between results for a metric across multiple tests to identify (and ignore) outliers and thus reduce false positives.

(a previous version of this PR contained an increased threshold for snapshot restore tests, due to their high instability, but this instability has since been fixed in #4151)

License Acceptance

By submitting this pull request, I confirm that my contribution is made under
the terms of the Apache 2.0 license. For more information on following
Developer Certificate of Origin and signing off your commits, please check
CONTRIBUTING.md.

PR Checklist

  • If a specific issue led to this PR, this PR closes the issue.
  • The description of changes is clear and encompassing.
  • Any required documentation changes (code and docs) are included in this PR.
  • API changes follow the Runbook for Firecracker API changes.
  • User-facing changes are mentioned in CHANGELOG.md.
  • All added/changed functionality is tested.
  • New TODOs link to an issue.
  • Commits meet contribution quality standards.

  • This functionality cannot be added in rust-vmm.

@codecov
Copy link

codecov bot commented Oct 3, 2023

Codecov Report

All modified lines are covered by tests ✅

Comparison is base (c723745) 82.99% compared to head (217ef23) 82.99%.

Additional details and impacted files
@@           Coverage Diff           @@
##             main    #4147   +/-   ##
=======================================
  Coverage   82.99%   82.99%           
=======================================
  Files         223      223           
  Lines       28448    28448           
=======================================
  Hits        23609    23609           
  Misses       4839     4839           
Flag Coverage Δ
4.14-c7g.metal 78.53% <ø> (+<0.01%) ⬆️
4.14-m5d.metal 80.33% <ø> (+0.01%) ⬆️
4.14-m6a.metal 79.46% <ø> (ø)
4.14-m6g.metal 78.53% <ø> (ø)
4.14-m6i.metal 80.31% <ø> (ø)
5.10-c7g.metal ?
5.10-m5d.metal 83.00% <ø> (ø)
5.10-m6a.metal 82.23% <ø> (ø)
5.10-m6g.metal 81.44% <ø> (ø)
5.10-m6i.metal 82.99% <ø> (ø)
6.1-c7g.metal 81.44% <ø> (ø)
6.1-m5d.metal 83.00% <ø> (ø)
6.1-m6a.metal 82.23% <ø> (ø)
6.1-m6g.metal 81.44% <ø> (ø)
6.1-m6i.metal 82.99% <ø> (ø)

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@roypat roypat force-pushed the ab-stability branch 3 times, most recently from c963c96 to f73f3e7 Compare October 4, 2023 09:54
@roypat roypat added the Status: Awaiting review Indicates that a pull request is ready to be reviewed label Oct 4, 2023
@zulinx86 zulinx86 self-requested a review October 5, 2023 09:42
Copy link
Contributor

@pb8o pb8o left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still need to review the last commit, but figured I should send those before the PR changes.

@roypat roypat force-pushed the ab-stability branch 2 times, most recently from 9137de5 to 4bcd3e8 Compare October 6, 2023 12:22
roypat added 6 commits October 9, 2023 11:06
The minimal p-value that a permutation test can return is
1/#permutations. Should a p-value smaller than this be chosen, the test
will never fail. Therefore, always pick 100 times as many permutations
as needed.

Signed-off-by: Patrick Roy <[email protected]>
Both tests were using identical microvm setups, so we factor it out into
a shared fixture (with indirect parameterization for the number of
vcpus).

Signed-off-by: Patrick Roy <[email protected]>
Some tests are inexplicably unstable, where even two consecutive runs on
the same machine give wildly different results. Until we find out why
these are this unstable, there is no real point in running them, so do
not fail the pipeline if these tests fail.

Signed-off-by: Patrick Roy <[email protected]>
Instead of using taskset to pin iperf3 worker to specific cores, use
iperf3's built-in --affinity argument to do so.

Signed-off-by: Patrick Roy <[email protected]>
These asserts are already done in CpuMap when pinning threads to CPUs,
so no need to repeat them here.

Signed-off-by: Patrick Roy <[email protected]>
Previously, we concatenated time series for tests that produced data
from multiple vcpus. Now, we instead overlay them, to accurately record
the net throughput for each second interval, across all workers. This
does not make a difference for permutation tests, but does allow us to
exploit some properties of averages of time series across different
parameterizations.

Signed-off-by: Patrick Roy <[email protected]>
pb8o
pb8o previously approved these changes Oct 9, 2023
tools/ab_test.py Show resolved Hide resolved
roypat added 3 commits October 9, 2023 11:12
The idea here is the following: Intuitively, different parameterization
of the same metric (e.g. the metric "snapshot restore latency" across
different microvm configurations) are not completely independent.
Therefore, we would generally expect impact to be reflected across
multiple tested configurations (and particularly, if different
configurations show mean changes in different directions, e.g. 1vcpu
gives +5ms latency, but 2vcpu gives -5ms latency, then this is a
further hint at a result being noise). To counteract this, we consider
the average of the reported mean regressions, and only report failures
for a metric whose average regression is at lesat 5%.

Mathematically, this has the following justification: By the central
limit theorem, the means of samples are normal distributed. Denote by A
and B the distributions of the mean of samples from the 'A' and 'B'
tests respectively. Under our null hypothesis, the distributions of the
'A' and 'B' samples are identical (although we dont know what the exact
distributions are), meaning so are A and B, say A ~ B ~ N(mu, sigma^2).
The difference of two normal distributions is also normal distributed,
with the means being subtracted and the variances being added.
Therefore, A - B ~ N(0, 2sigma^2). However, each parameterization (e.g.
1vcpu, 2vcpu, and so on) will potentially have different variances
sigma^2. Here, we do the following assumption: For all parameterizations
of the same metric, we have sigma^2/mu^2 = const. I have no mathematical
justification for this (as it is a property of our tests), but
empirically it works out. This means that (A-B)/mu ~ N(0, c), with c
being a constant that is identical across all parameterizations of a
metric. This means that we can combine the relative means across
different parameterizations, and get a distributions whose expected
value is 0, provided our null hypothesis was true. This is exactly what
the code added in this commit verifies: The empirical average of our
samples of this distribution only negligibly deviates from 0. Only if
it deviates significantly (here by more than 0.05), we actually allow
failures in that metric.

For all tests but the snapshot restore test on x86_64/{5.10,6.1}, this
allows us to completely drop the per-test noise threshold to 0.

Signed-off-by: Patrick Roy <[email protected]>
They are remnant of when the performance tests were run without pytest
parameterization, meaning all test scenarios were run inside of the same
pytest test. This required large timeouts. Now with pytest
parameterization, the runtime of each test is at most 30s, which
conformtably fits into pytest's default timeout of 6min.

Signed-off-by: Patrick Roy <[email protected]>
Since we no longer need to maintain separate sets of baselines for each
release branch, it becomes feasible to run performance tests on release
branches. These only trigger when PRs are merged.

Signed-off-by: Patrick Roy <[email protected]>
It lives in the "CloudWatchMetrics" subdictionary, not in the "_aws"
root dictionary. This was prevening metrics from showing up in the
correct namespace in the cloudwatch web interface (they were listed
under "local" instead of "PerfTests").

Signed-off-by: Patrick Roy <[email protected]>
@roypat roypat merged commit 9232614 into firecracker-microvm:main Oct 9, 2023
5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Status: Awaiting review Indicates that a pull request is ready to be reviewed
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants