Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Race condition in esp_modem_dte (IDFGH-10767) #333

Closed
3 tasks done
mryndzionek-txwireless opened this issue Jul 31, 2023 · 8 comments
Closed
3 tasks done

Race condition in esp_modem_dte (IDFGH-10767) #333

mryndzionek-txwireless opened this issue Jul 31, 2023 · 8 comments

Comments

@mryndzionek-txwireless
Copy link

Answers checklist.

  • I have read the documentation for esp-protocols components and the issue is not addressed there.
  • I have updated my esp-protocols branch (master or release) to the latest version and checked that the issue is present there.
  • I have searched the issue tracker for a similar issue and not found a similar issue.

General issue report

I seem to be hitting this exception:

I think that set_read_cb is being run twice before this check. It seems to happen when the timings are being "relaxed" by increased logging level, or some higher priority task running in the background. I've found this commit, but yeah, the race is still there. Having the GOT_LINE signal, but then also reading the result variable doesn't guarantee atomicity.
Let me know if you need more details.

@github-actions github-actions bot changed the title Race condition in esp_modem_dte Race condition in esp_modem_dte (IDFGH-10767) Jul 31, 2023
@david-cermak
Copy link
Collaborator

Thanks for the report! Could you please check if this PR #312 fixes your issues?
This simply adds a mutex to protect the callback.

The thread safety was kind of handled (in a very fragile way) by cooperation between the DTE class and its terminal (so the attached terminal would signal when it's okay to clean the callback), but it got broken when adding USB terminal in 44bae24. I'm sorry, I posted the fix some time ago but haven't really gotten to merge it and resolve this issue sooner.

@mryndzionek-txwireless
Copy link
Author

Thanks for the report! Could you please check if this PR #312 fixes your issues?

Yes, I don't see the race, but there seems to be one regression: command_result::OK is returned instead of command_result::TIMEOUT in generic_command. void DTE::set_command_callbacks probably needs:

command_cb.result = command_result::TIMEOUT;

@david-cermak
Copy link
Collaborator

Good catch, thanks!

Have set the default state into:

https://github.com/espressif/esp-protocols/blob/22da99a540c135fd2f6610884ce8916714d6b253/components/esp_modem/src/esp_modem_dte.cpp#L120C19-L133

@mryndzionek-txwireless
Copy link
Author

mryndzionek-txwireless commented Aug 17, 2023

CMUX mode seems to be broken on #312. Haven't managed to get to the bottom, but it seems the AT-commands responses are being ignored in CMUX.

@mryndzionek-txwireless
Copy link
Author

Also, my confirmation above might be premature... I still seem to be hitting https://github.com/david-cermak/esp-protocols/blob/fix/minor_modem_refactor/components/esp_modem/src/esp_modem_dte.cpp#L321, but this time when logging level is INFO, or higher.

@david-cermak
Copy link
Collaborator

You're correct, there might be also other issues with the PR. I'm sorry, it's been 2 months since I touched it and I need to spend some time with it.

Clearly this change #333 (comment) did help introducing a data race, since we need to set the TIMEOUT state while the callback is locked. Was able to reproduce the bug just running a host test, and our CI also indicated something wrong: https://github.com/espressif/esp-protocols/actions/runs/5890096726/job/15974984309

@david-cermak
Copy link
Collaborator

I'm sorry it took so long, but I've finally merged #312

You were right, there were multiple issues with that PR, not only the locking:

Now it should be fixed and merged to master.

I'll test it some more, and if everything works I'd create a new release.

@david-cermak
Copy link
Collaborator

This issue has been resolved in cb6e03a, after merging #312; closing now, only after proper testing and releasing of a new version.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants