Final A/B-Tests #4139

roypat · 2023-09-28T14:22:51Z

Changes

Adds A/B compatible versions of the vsock and tcp throughput tests. Also includes the following small refractorings/improvements:

Declare specific tests are particularly unstable. These are the tests that already have significant delta values associated with them in the current testing setup.
Correctly associate p-values with metrics when submitting to Cloudwatch
Correctly propagate CW metadata when reemitting captured logs (namespace and log stream name)
Only run A/B tests if rust/python code was modified.
Merge the two network tests into one file, because they are significantly simpler now

License Acceptance

By submitting this pull request, I confirm that my contribution is made under
the terms of the Apache 2.0 license. For more information on following
Developer Certificate of Origin and signing off your commits, please check
CONTRIBUTING.md.

PR Checklist

If a specific issue led to this PR, this PR closes the issue.
The description of changes is clear and encompassing.
Any required documentation changes (code and docs) are included in this PR.
API changes follow the Runbook for Firecracker API changes.
User-facing changes are mentioned in CHANGELOG.md.
All added/changed functionality is tested.
New TODOs link to an issue.
Commits meet contribution quality standards.

This functionality cannot be added in rust-vmm.

codecov · 2023-09-28T14:31:34Z

Codecov Report

All modified lines are covered by tests ✅

Comparison is base (a376ab1) 83.10% compared to head (85dae0f) 83.10%.

Additional details and impacted files

@@           Coverage Diff           @@
##             main    #4139   +/-   ##
=======================================
  Coverage   83.10%   83.10%           
=======================================
  Files         225      225           
  Lines       28604    28604           
=======================================
  Hits        23771    23771           
  Misses       4833     4833

Flag	Coverage Δ
4.14-c7g.metal	`78.67% <ø> (ø)`
4.14-m5d.metal	`80.47% <ø> (ø)`
4.14-m6a.metal	`79.61% <ø> (ø)`
4.14-m6g.metal	`78.67% <ø> (ø)`
4.14-m6i.metal	`80.45% <ø> (ø)`
5.10-c7g.metal	`81.57% <ø> (ø)`
5.10-m5d.metal	`83.13% <ø> (ø)`
5.10-m6a.metal	`82.37% <ø> (ø)`
5.10-m6g.metal	`81.57% <ø> (ø)`
5.10-m6i.metal	`83.11% <ø> (ø)`
6.1-c7g.metal	`81.57% <ø> (+<0.01%)`	⬆️
6.1-m5d.metal	`83.13% <ø> (ø)`
6.1-m6a.metal	`82.37% <ø> (ø)`
6.1-m6g.metal	`81.57% <ø> (ø)`
6.1-m6i.metal	`83.11% <ø> (ø)`

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

tests/integration_tests/performance/test_vsock_ab.py

Similar to the snapshot, network latency and block tests, our vsock throughput test adapted to work with A/B-testing. Signed-off-by: Patrick Roy <[email protected]>

Converts the existing network tcp throughput test into an A/B-compatible TCP throughput test. This test is added to test_network_ab.py, since the converted test is short enough that we can collect all the network related tests in a single file. Signed-off-by: Patrick Roy <[email protected]>

Without this information, it is impossible to tell apart p-values for different metrics emitted from the same test. Signed-off-by: Patrick Roy <[email protected]>

Otherwise, the reemitted metrics will carry a namespace of "local", which does not make much sense inside of Cloudwatch. Signed-off-by: Patrick Roy <[email protected]>

Before emitting raw data, we only submitted the average to display on our dashboards. However, ever since submitting the raw data, we could have just instructed cloudwatch to compute averages from that, without needing to also submit the averages. So actually switch to doing that. Also rename network_ab's "latency" metric to "ping_latency" for backward compatibility. Signed-off-by: Patrick Roy <[email protected]>

Because its a waste of comptue resources to run it if only markdown files were changed. Signed-off-by: Patrick Roy <[email protected]>

We want to call the one in host_tools.metrics so that the properties are correctly set. Signed-off-by: Patrick Roy <[email protected]>

The raw time series emitted from iperf3 include data points from the warmup period. Since we do our A/B-Test on this entire time series, we should exclude them. Signed-off-by: Patrick Roy <[email protected]>

roypat force-pushed the most-ab branch 4 times, most recently from 138d51c to faa15d2 Compare September 29, 2023 09:07

roypat marked this pull request as ready for review September 29, 2023 09:07

roypat force-pushed the most-ab branch 3 times, most recently from b809012 to 4e6ec9f Compare September 29, 2023 09:28

zulinx86 previously approved these changes Sep 29, 2023

View reviewed changes

tests/integration_tests/performance/test_vsock_ab.py Outdated Show resolved Hide resolved

tests/integration_tests/performance/test_vsock_ab.py Outdated Show resolved Hide resolved

roypat dismissed zulinx86’s stale review via 766a491 September 29, 2023 14:14

roypat force-pushed the most-ab branch 2 times, most recently from 766a491 to af2fdc4 Compare September 29, 2023 14:44

roypat added 7 commits September 29, 2023 15:52

test: Add A/B-compatible vsock throughput test

c94cafb

Similar to the snapshot, network latency and block tests, our vsock throughput test adapted to work with A/B-testing. Signed-off-by: Patrick Roy <[email protected]>

test: Add what metric a p-value is for to its dimensions

31f6354

Without this information, it is impossible to tell apart p-values for different metrics emitted from the same test. Signed-off-by: Patrick Roy <[email protected]>

test: Correctly propagate CW namespace/logstream name when reemitting

d61e543

Otherwise, the reemitted metrics will carry a namespace of "local", which does not make much sense inside of Cloudwatch. Signed-off-by: Patrick Roy <[email protected]>

test: Only run performance A/B tests if rust code was modified

0533cea

Because its a waste of comptue resources to run it if only markdown files were changed. Signed-off-by: Patrick Roy <[email protected]>

fix: Call correct metrics_logger factory in ab_test.py

80f423d

We want to call the one in host_tools.metrics so that the properties are correctly set. Signed-off-by: Patrick Roy <[email protected]>

roypat force-pushed the most-ab branch from af2fdc4 to 80f423d Compare September 29, 2023 14:52

test: Exclude warmup datapoints from A/B-Test

ba2f971

The raw time series emitted from iperf3 include data points from the warmup period. Since we do our A/B-Test on this entire time series, we should exclude them. Signed-off-by: Patrick Roy <[email protected]>

roypat added the Status: Awaiting review Indicates that a pull request is ready to be reviewed label Oct 2, 2023

pb8o approved these changes Oct 2, 2023

View reviewed changes

Merge branch 'main' into most-ab

85dae0f

zulinx86 approved these changes Oct 2, 2023

View reviewed changes

roypat merged commit cd27794 into firecracker-microvm:main Oct 2, 2023
5 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Final A/B-Tests #4139

Final A/B-Tests #4139

roypat commented Sep 28, 2023

codecov bot commented Sep 28, 2023 •

edited

Loading

Final A/B-Tests #4139

Final A/B-Tests #4139

Conversation

roypat commented Sep 28, 2023

Changes

License Acceptance

PR Checklist

codecov bot commented Sep 28, 2023 • edited Loading

Codecov Report

codecov bot commented Sep 28, 2023 •

edited

Loading