Flaky `console-subscriber` integration tests #473

hds · 2023-10-11T10:30:12Z

What crate(s) in this repo are involved in the problem?

console-subscriber

What is the issue?

The console-subscriber integration tests show flaky behavior locally.

The tests were originally introduced in #452.

How can the bug be reproduced?

Run all the framework tests locally. Sometimes the self_wakes test fails.

To reproduce this, I ran the tests in a loop until they failed. On my machine it typically takes under 5 minutes to produce a failure.

echo "Start!" && time while [ $? -eq 0 ]; do cargo test -p console-subscriber --test framework; done

Logs, error output, etc

No response

Versions

main branch.
console-subscriber v0.2.0

Possible solution

It appears that the root cause for this issue is the one described in tokio-rs/tracing#2743.

Additional context

No response

Would you like to work on fixing this bug?

yes

The text was updated successfully, but these errors were encountered:

The `ConsoleLayer` uses the `Callsites` struct to store and check for callsites for specific kinds of traces, for example spawn spans or waker events. `Callsites` stores a fixed size array of pointers to the `Metadata` for each callsite and a length to indicate how many callsites it has registered. The length and each individual pointer are stored in atomics. Since it is possible for these values to change individually, if a callsite lookup fails, we check if the length of the array has changed while we were checking the pointers, if it has, the lookup is started again. However, there is still a possible race condition. If the length changes, but the lookup occurs before the callsite pointer is actually written, then we may miss a callsite that is in the process of being registered. In this case, the pointer which is loaded from the `Callsites` array will be null. This change adds a check for this case (null ptr), and reperforms the lookup if it occurs. This race condition was found while chasing down the source of #473. It doesn't solve the flakiness, but it can reduce the likelihood of it occuring, thus it is a mitigation only. In reality, neither of these race condition checks should be needed, as we would expect that `tracing` guarantees that `ConsoleLayer` completes `register_callsite()` before `on_event()` or `new_span()` are called.

A flakiness problem has been discovered with the `console-subscriber` integration tests introduced in #452. Issue #473 is tracking the issue. It has been observed that we only "miss" the wake operation event when it comes from `yield_now()`, but not when it comes from a task that yielded due to `sleep`, even when the duration is zero. it is likely that this is due to nature of the underlying race condition. This change removes all the calls to `yield_now()` from the `framework` tests, except those where we wish to actually test self wakes.

A flakiness problem has been discovered with the `console-subscriber` integration tests introduced in #452. Issue #473 is tracking the issue. It has been observed that we only "miss" the wake operation event when it comes from `yield_now()`, but not when it comes from a task that yielded due to `sleep`, even when the duration is zero. it is likely that this is due to nature of the underlying race condition. This change removes all the calls to `yield_now()` from the `framework` tests, except those where we wish to actually test self wakes. Additionally, all the sleeps have been moved out into a separate function which describes why we're using `sleep` instead of `yield_now` when either of them would be sufficient.

Rustin170506 · 2023-11-25T13:06:47Z

Here is another one: double_sleep_wakes
https://github.com/tokio-rs/console/actions/runs/6989256116/job/19017617547?pr=494

hds · 2024-04-10T12:23:34Z

These two tests are so flaky that I'm going to delete them until we fix the problem. From console-subscriber/tests/wake.rs:

#[test]
fn sleep_wakes() {
    let expected_task = ExpectedTask::default()
        .match_default_name()
        .expect_wakes(1)
        .expect_self_wakes(0);

    let future = async {
        sleep(Duration::ZERO).await;
    };

    assert_task(expected_task, future);
}

#[test]
fn double_sleep_wakes() {
    let expected_task = ExpectedTask::default()
        .match_default_name()
        .expect_wakes(2)
        .expect_self_wakes(0);

    let future = async {
        sleep(Duration::ZERO).await;
        sleep(Duration::ZERO).await;
    };

    assert_task(expected_task, future);
}

We have identified that some of the `console-subscriber` integration tests are flaky (#473). This appears to be a result of the following issue in `tracing` tokio-rs/tracing#2743. Flaky tests are really worse than no tests. So these tests will be removed until the flakiness can be fixed.

hds added the S-bug Severity: bug label Oct 11, 2023

hds mentioned this issue Oct 11, 2023

fix(subscriber): mitigate race in Callsites::contains #474

Closed

hds mentioned this issue Oct 11, 2023

test(subscriber): prefer sleep over yield_now in tests #475

Merged

hds added the C-subscriber Crate: console-subscriber. label Oct 11, 2023

hds mentioned this issue Nov 19, 2023

task: make task span explicit root tokio-rs/tokio#6158

Merged

hds mentioned this issue Apr 10, 2024

test(subscriber): remove very flaky tests #544

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Flaky `console-subscriber` integration tests #473

Flaky `console-subscriber` integration tests #473

hds commented Oct 11, 2023

Rustin170506 commented Nov 25, 2023

hds commented Apr 10, 2024

Flaky console-subscriber integration tests #473

Flaky console-subscriber integration tests #473

Comments

hds commented Oct 11, 2023

What crate(s) in this repo are involved in the problem?

What is the issue?

How can the bug be reproduced?

Logs, error output, etc

Versions

Possible solution

Additional context

Would you like to work on fixing this bug?

Rustin170506 commented Nov 25, 2023

hds commented Apr 10, 2024

Flaky `console-subscriber` integration tests #473

Flaky `console-subscriber` integration tests #473