Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Channel sending/receiving ends up spinlooping and deadlocking #114851

Open
kyrias opened this issue Aug 15, 2023 · 5 comments
Open

Channel sending/receiving ends up spinlooping and deadlocking #114851

kyrias opened this issue Aug 15, 2023 · 5 comments
Labels
C-bug Category: This is a bug. S-needs-repro Status: This issue has no reproduction and needs a reproduction to make progress. T-libs Relevant to the library team, which will review and decide on the PR/issue.

Comments

@kyrias
Copy link
Contributor

kyrias commented Aug 15, 2023

We have a system built around an Espressif ESP32-S3 MCU using ESP-IDF/FreeRTOS. Recently when we updated our Rust toolchain we started having issues where under certain conditions the watchdog timer would constantly trigger and reset the system. We've managed to track it down to being caused by the new crossbeam-channel-based Channel spinlooping in try_send and try_recv.

Many parts of our system communicate using Channels and the highest priority threads are the ones that read measurements from sensors and then sends those measurements to various channels for further processing using try_send. These threads also read from command channels using try_recv.

Our expectation with this approach is that the sending and receiving from these channels should never block on waiting for other threads to run and if we can't read/send anything right then the methods should immediately return an Err which we ignore.

Through some judicious println!-debugging I've found that when we call try_send or try_recv we sometimes end up in a situation where start_send/start_recv performs the following spin_light calls multiple thousands of times:

backoff.spin_light();

backoff.spin_light();

This then leads to our idle task never getting to run and thus the watchdog timer times out and resets the system. Disabling the watchdog timer doesn't seem to let it ever get unstuck on its own.

I've tried switching to crossbeam-channel as well and while it seems harder to reproduce using that crate it's still happening.

Meta

rustc --version --verbose:

rustc 1.71.0-nightly (4ca000ac8 2023-07-13) (1.71.0.1)
Backtrace

0x420304b2 - <core::ops::range::Range<T> as core::iter::range::RangeIteratorImpl>::spec_next
    at ??:??
0x3fcd9bd0 - _btdm_bss_end
    at ??:??
0x4203038e - std::sync::mpmc::array::Channel<T>::start_send
    at ??:??
0x3fcd9c00 - _btdm_bss_end
    at ??:??
0x42029852 - std::sync::mpmc::Sender<T>::try_send
    at ??:??
0x3fcd9c50 - _btdm_bss_end
    at ??:??
0x42038706 - std::sync::mpsc::SyncSender<T>::try_send
    at /home/remmy/.rustup/toolchains/esp/lib/rustlib/src/rust/library/std/src/sync/mpsc/mod.rs:739
0x3fcd9c70 - _btdm_bss_end
    at ??:??
0x4200b3e3 - std::thread::Builder::spawn_unchecked_::{{closure}}::{{closure}}
    at /home/remmy/.rustup/toolchains/esp/lib/rustlib/src/rust/library/std/src/thread/mod.rs:529
0x3fcd9d50 - _btdm_bss_end
    at ??:??
0x420bd223 - <alloc::boxed::Box<F,A> as core::ops::function::FnOnce<Args>>::call_once
    at /home/remmy/.rustup/toolchains/esp/lib/rustlib/src/rust/library/alloc/src/boxed.rs:1985
0x3fcd9de0 - _btdm_bss_end
    at ??:??
0x420c3735 - <alloc::boxed::Box<F,A> as core::ops::function::FnOnce<Args>>::call_once
    at /home/remmy/.rustup/toolchains/esp/lib/rustlib/src/rust/library/alloc/src/boxed.rs:1985
0x3fcd9e00 - _btdm_bss_end
    at ??:??
0x420ef854 - pthread_task_func
    at /home/remmy/src/i/elofleet/firmware/elobox/.embuild/espressif/esp-idf/v5.0.3/components/pthread/pthread.c:196
0x3fcd9e20 - _btdm_bss_end
    at ??:??

@kyrias kyrias added the C-bug Category: This is a bug. label Aug 15, 2023
@rustbot rustbot added the needs-triage This issue may need triage. Remove it if it has been sufficiently triaged. label Aug 15, 2023
@the8472
Copy link
Member

the8472 commented Aug 15, 2023

probably a duplicate of #112723 though that one is only about try_recv

@kyrias
Copy link
Contributor Author

kyrias commented Aug 15, 2023

They look related, but the actual spinning case from their backtrace is different.


Ultimately it's not a big problem for us if try_recv blocks for a short while in certain cases, but it's a big problem if it's a spinloop because in RTOS conditions it means that lower-priority threads will never get to run, and so the whole system hangs.

@the8472
Copy link
Member

the8472 commented Aug 15, 2023

It seems like they're slightly different instances of the general symptom: something is spinning during a priority inversion. With RT scheduling the inversion can last forever. With less strict scheduling it merely takes a while until it resolves.

@saethlin saethlin added T-libs Relevant to the library team, which will review and decide on the PR/issue. and removed needs-triage This issue may need triage. Remove it if it has been sufficiently triaged. labels Aug 15, 2023
@ibraheemdev
Copy link
Member

@kyrias can you try running your code with the crossbeam-rs/crossbeam#1105 branch of crossbeam to see if that fixes your issue?

@Enselic
Copy link
Member

Enselic commented Nov 13, 2024

Triage: Can someone provide code to reproduce this issue please? Can the problem still be reproduced?

@Enselic Enselic added the S-needs-repro Status: This issue has no reproduction and needs a reproduction to make progress. label Nov 13, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-bug Category: This is a bug. S-needs-repro Status: This issue has no reproduction and needs a reproduction to make progress. T-libs Relevant to the library team, which will review and decide on the PR/issue.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants