-
Notifications
You must be signed in to change notification settings - Fork 7.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
HCI: after MTU change, gatt client's answer is not pushed to HCI (IDFGH-13790) #14648
Comments
Anyone please?! This is unbelievable. HCI times out in every single 10-12 hours and NO log is dumped/printed on debug serial. I'm now on latest latest latest idf, with latest latest lib. |
If I disable le ad filtering, esp32 can't survive for more than 3hrs. it needs to be RESETTED in every single 3 hrs. Please someone say something! NO coredump, NO debug log, NO nothing on serial console. Only physical HW RESET is the solution for 3hrs. |
I got some updates - although you don't care - shame on you. After constant stress load: LE scanning and occasionally gatt connection to a device. Gatt device wants to increase MTU from 23 to 247 in every connection. We accept this. But after a couple of hours, this is simply NOT working anymore: gatt device not answering. Now big update! After the stress test fails: I can still connect manually to this gatt device (via gatttool)!!! Gatttool refuses accepting MTU request so it stays 23 and gatt device will also answer to this. I believe gatt device would answer anyway (no matter weather we accept/refuse its MTU change request), but esp is NOT putting it into hci anymore until it gets a reset. Does this makes sense to you? So: Is it some buffer handling problem in esp32 lib? Please folks. |
Note: gate's reply is not received in either case:
Only in one single case will gatt client's reply reach us: when we don't accept the new mtu at all (AND we don't request MTU increase either) |
Anyone knows anybody who can provide me the source of libbtdm_app.a, or I will have to reverse engineer it? Maybe @BetterJincheng? |
Ping |
pong |
@danergo |
@esp-zhp: sure. Thank you! I have some addition: when ESP starts this behavior (i.e doesn't forwards packets after mtu negotiations has been done), a 100% fix is to restart the ESP with its reset button (/en switch).
This is (I believe) sending a hci reset command which seems to solve the issue for another 10-12 hours. |
Currently, there isn't enough information for me to pinpoint the issue. |
HCI reset command can resolve my issue. Other than that, I have thousands (if not millions) of lines of "btmon" logs. BlueZ doesn't send any extra command (at least nothing extra is visible in button logs). In case hci reset fixes the mtu problem, how would it be ideal? Thank you! |
I have some news: while I was away from this device, ESP started behaving wrong again (same MTU problem). But after constant thorough connection requests (1130 connection retrials for more than 3 hours!!!), it 'fixed' itself: now MTU negotiation doesn't ruin the incoming indication packets, but I believe after 10-12hours it will break again. |
If resetting the HCI resolves the MTU issue, I believe it's acceptable. |
Sorry, I don't think it's acceptable:
Now, this device doesn't move a single millimeter, it's staying in one place (so as the ESP). My point here is, that from my application, I can't judge weather device timeout is due to real timeout, or ESP misbehavior. Anyway, in dmesg, I see these errors a lot:
This is timeout, ESP doesn't answer for my requests. This happens, when ESP reaches the problematic phase. |
Could you capture the packets to verify if the peer device is indeed sending Indications above 23? If the peer device doesn't send the Indications, then the ESP device wouldn't be able to receive them either. If the issue still persists, I need more information to further diagnose the issue. Could you provide the complete HCI data? |
Dear @esp-zhp: I would happily provide any logs if it can help you with diag. "LE Create connection" always succeeds. Also MTU exchange request (from 23 to 247) also always succeeds. Shorter characteristic writes, and their confirmations also always succeeds. Longer characteristics writes and their confirmations also always succeeds. Shorter indications also always received. Longer than 23 indications received for 10-12 hours, then they are not forwarded back to hci anymore. I doubt the client has any problems because 2 quite hard reasons:
HCI data: I have many. All created by button -w. Is it okay for you? Can I share this privately? |
One more thing needs to be confirmed to narrow down the issue. Do you think your problem is related to classic Bluetooth? If you are not using classic Bluetooth, does the issue still persist? I'm responsible for BLE and don't have much knowledge about classic Bluetooth. If you believe the issue is related to classic Bluetooth, I can ask my colleagues who handle classic Bluetooth to assist you. |
Could you please send me all the HCI logs from the ESP32? It would be best to use GitHub so that other colleagues can also view them. If it's not convenient for you to share publicly on GitHub, you can also send them to my email ([email protected]). and Do you have any packet capture devices on your side?It would be even better if you could capture the packets to confirm whether the ESP32 has sent an indication(ATT_HANDLE_VALUE_IND)when mtu above 23. |
Sorry for the confusion! Let me clear things up! I'm using ESP for Dual-Mode Bluetooth Controller (controller_hci_uart example from this repo). The issue is related purely to BLE. My RPi is connected via UART to this ESP. ESP is responsible for providing Bluetooth to this RPi (it has no onboard Bt). RPi is attaching the ESP with btattach, therefore RPi sees hci0. BlueZ uses this hci0 interface to provide Bluetooth functionality to my app on RPi. My app is constantly monitoring LE advertisements from a bunch of devices (about 10). No filtering is enabled. Occasionally (4-5 times per 10hrs) my app connects to a remote client with gatt-charactetistics, notifications and indications. This occasional "LE Create Connection" always succeeds, but the client is asking an "MTU Exchange Request" after the connection is established (23->247). My app always accepts this new mtu, and responds as intended. Then my app subscribes to indications by writing a specific data to a gatt handle. If we don't do reset, longer indications are not arriving anymore. All the rest details are provided earlier: Does this change anything? Thank you! |
@esp-zhp: mail sent to you, would be appreciated if you could take a look at it. |
Hi, @esp-zhp:
Yes, I do have esp terminal logs yes, will attach here.
It is issuing the "LE Create Connection", but "btmon" recorded the timestamps in UTC, while "dmesg" output has 2 hours later timings.
Yes, I believe this is the correct terminology.
That's correct, there is a 2hour difference:
You will find it at 08:31:15 in the HCI log (packet no: 37761: LE Create Connection).
Please check the problematic parts: from 15261 - 47498 (07:03:07 - 09:07:25 in HCI log): constant, thorough trying of connection. LE Create Connection succeeds! GATT writes confirmed! Short GATT Indications are received! But not a single sing of the long indication. You can see a working example starting in 14584 (06:59:55,57) please pay attention to long indication for this connection in 14611 (06:59:55,69): 86 bytes long, response for our "0x0092" GATT write request in 14608 (06:59:55,64). Connection to this client is always done by: Now, with correct sequences at the beginning of the HCI logs, you can investigate this behavior, then you will see the problematic parts from 15261 - 47498 (07:03:07 - 09:07:25 in HCI log): step5 is missing. There are more than 2 hours, and more than 30000 HCI Packets trying to connect to this Client. During this period, any other device can perfectly connect to the same Client (therefore Client is behaving correctly). Also, during this period, in case I deny the MTU change, ESP will forward the indication in step5, with multiple indications (23-23-23-17). Thank you very much, I appreciate your time spent on this. |
Thank you, compiled, and works. 1 day passed without any issue after retesting began. |
@esp-zhp, it failed again, (with logs disabled). So I believe this is indeed some timing issue. Will share you the logs soon. |
@esp-zhp: logs shared. |
@danergo 2-If the "Sent Write Request, Handle: 0x0092" is not successfully sent, the application layer should retry after 5 seconds instead of immediately disconnecting. If the second attempt also fails, disconnect after another 5 seconds. 3-don‘t use my lib, origin lib is ok thanks. |
My code has no influence or knowledge of L2CAP messages. It just:
Moreover, in case I try manually (with gatttool): this means im typing manually, no script involved here: This WORKS for 10-12hrs, but then suddenly long writes are not sent out anymore, UNTIL esp gets reset. And as you can see gatttool has no info on l2cap. Now I'm more than sure gatttool is not changing anything under the hood after 10-12 hrs, moreover I don't have to restart the Linux, only the esp to get back to normal. Moreover in case we enable the verbose log in your library, it hides this problem. Again, this is something we can't control, and now it's more than likely that esp behaves different after running 10-12 hrs. We can solve this by either running it wit verbose log, or restart it every hour. Neither is too effective. What do you think? |
you may not have understood what I meant. -> I believe this approach might solve your issue. |
Dear @esp-zhp: I understand your points, let me rephrase my concerns here, reacting on yours:
Please see these two below screenshots, taking into consideration my labels and explanations: (For others: each screenshot consist of two logs: upper is recorded with a sniffer, and bottom is logged directly on the RPi which ESP32 is attached to) Good commincation trial steps:
Key takeaways for good trial:
Okay, let's move on to the stage when ESP starts behaving wrong: Failed commincation trial steps:
Key takeaways for failed trial:
Additional notes, regarding your recommendation of
For 2.) please see this new log (btmon_finalproof.cap, shared with your email too): I followed these steps now:
Key takeaway: This is the root cause of the problem: after 10-12 hours, ESP's controller does NOT respect MTU changes anymore, leaving the RX buffer on default 63 bytes therefore ESP is NOT able to process any longer messages until it's reset. I hope you accept this as proof. To me this looks straightforward, but please let me know in case you need any further information/clarification. I really appreciate your help and efforts! Thank you very much |
I meant HCI rx buffer here. |
@esp-zhp: did you have time for checking my last details here? |
@danergo |
Unfortunately I can't resend, as the remote Client is disconnecting after about 1sec in case it doesn't hear from us. You only see this terminate command, because my app doesn't receive any notification about client's disconnect, so we implemented a (max)5 second timeout: after 5seconds, it is 100% impossible that we are still connected so it's safe to drop and close the connection. Client is on battery so it has limited time for communication. What is strange for me is that 63byte threshold: it really seems some default threshold. I can send anything below 64 length, and nothing from 64bytes length. That should be a great way to start finding some spots :) |
Hi, @esp-zhp: do you have any news? |
@esp-zhp: i have a new idea: Creating a firmware in which we can turn on/off logs dynamically via debug uart (i.e.: not hardcode log setting into firmware). In this case we can start with hci logging turned off, then once issue occurs, we can turn on hci logging over debug port, without rebooting esp. What do you think? Can this help? How shall we read from debug uart without disturbing the btdm controller? |
@danergo Based on your description, there is a version with HCI logging enabled. If HCI logging is always on, it works fine, but power consumption will be higher. For now, you can use the library with HCI logging enabled to continue your development. You mentioned: “Creating a firmware that allows us to dynamically turn logging on and off via the debug UART (i.e., without hardcoding log settings into the firmware). In this case, we can start with HCI logging turned off, and then, if an issue occurs, enable HCI logging through the debug port without rebooting the ESP.” I’m not sure if this approach will work, as I haven’t yet identified the root cause of the issue. I will update the library and plan to add debug information for suspected areas. When an issue occurs, you can dump this debug information (I’ll provide a new API for this), and I’ll use the data to help resolve the problem. I’m still working on the debug library, and this may take some time. In the meantime, please proceed with your testing and development. |
Okay, thank you! |
Hi, @esp-zhp: did you have some time to progress? We are running our main service with HCI logs enabled, and it's smooth. But I think this shall not be considered as a final solution. Thank you! |
yes,I will give you lib this week |
Dear @esp-zhp: We are sending you a new "proof" capture: this time with:
This contain a LOT of garbage, please forgive us. Please focus on the ATT frames. You will see normal procedures of GATT writes and confirmations. Focus on 0x0092 Write Response! You'll see: 0x0092 responses are coming until ID#634057 (last Write Response in log). Afterwards, you'll see no more Write Responses (for 0x0092). As this time we have the debug log enabled, we can state, that simply enabling "HCI verbose log" doesn't solve this problem. (Luckily! At least, logging is not hiding this bug :) ) Please check our logs, maybe you'll see something interesting which can help you with this topic. |
From the HCI log on your host side, it appears that Bluz sent a write request (handle 0x92), but the ESP32 controller either did not send the packet or failed to send it promptly. This seems to be an issue on the ESP32 controller's side. However, from the ESP32 controller's perspective, it did not receive the write request (handle 0x92). I’m unsure how the HCI log on Bluz was captured or whether it is entirely accurate. |
That's exactly the root cause of the problem! Please note: all ATT 146 messages are long (longer than 63 bytes). I have tested this many times: after ESP (or BlueZ) gets into this problematic stage, any write request (ATT handle doesn't matter) with data length above 63 won't send out (anything below or equal 63 will be sent out perfectly)! Let's try to identify the wrong part here: For the 100% proof, let me wait for a couple of days to let the problematic stage arrive, and then I'll hook up the oscillator to the ESP's RX pin, to make you a final proof. :) Thank you! |
Hi, all (@esp-zhp). We have concluded a final test with an O-scope tapped onto the TX pin of RPi (RX pin of ESP32). Our assumption has been 100% validated now: RPi sends the long gatt write, but ESP32 doesn't acknowledge it. Please see the shared log: it's a small one! Facts:
For the last step (3), we have recorded the screen of the scope: this is the proof, that ESP's RX pin is receiving this long write (takes almost 700us), but I assume you will not see this in the debug log from ESP (shared with you too): This is proved now an ESP-side bug, and it is present in the closed-source library, so we can't do anything else on our side to fix it. |
@danergo
|
Thank you for extensive description. I am more than sure that my long write is correct although I have shared a scope screenshot and not a signal analyser (measured shorter gatt writes, and scope showed shorter data flow). However, we do have signal analyser as well, will provide you that evidence too. Will also share our sdkconfig, sleep mode (as far as I remember) is enabled. Will attach it here. Thank you! |
Great! Make sure to print the relevant registers when an issue occurs. Disable sleep mode. |
how do you think A potential solution? |
Uart flow control is enabled. Anyway, for VHCI we have just a slight knowledge. Thank you. Will get back to you with details later today. |
OMG, my sdkconfig is huge, compared to example's. Please find it attached here. Many sleep configs are enabled (which might have been modified by us). Will do now 2 things: 1.) Signal-trace the RX pin for the HCI data We have high hopes about this sleep configuration that it can solve this problem :) Thank you! |
Answers checklist.
IDF version.
Latest
Espressif SoC revision.
NodeMCU-ESP-32S
Operating System used.
Linux
How did you build your project?
Command line with idf.py
If you are using Windows, please specify command line type.
None
Development Kit.
NodeMCU-ESP-32S
Power Supply used.
External 5V
What is the expected behavior?
Stable operation
What is the actual behavior?
Manual power recycle is needed in every 12hrs.
hciconfig hci0 reset is timing out.
btattach also times out.
Watchdog enabled but it is not triggering a reset.
Coredump enabled but no coredump is being written.
Verbose logging also enabled but only few log items are shown.
Steps to reproduce.
Ble ad scanning with hardware filtering (based on device mac and ad) at least 8 devices.
In every 5mins, try connecting to a standard (not ble) devices (which is out of range) - so connection will have to fail always.
Occasionally connect to a ble device (which is in range and shall be succeed).
Every 12 hours (roughly) we have to manually reset the esp. Otherwise hci0 will eventually go down.
Before hci0 going down, we can still try connecting to a ble device but we can't receive longer data from it.
(Ble device asks us an MTU increase, and we accept it, but then we can't receive data: but this happens ONLY after 10-12 hours of constant stressing esp with the above advertising scaninngs and 5mins inactive device connect trials).
I guess some buffer is overfilling but I couldnt enable any practical logging in menuconfig.
What do you suggest?
Debug Logs.
No response
More Information.
No response
The text was updated successfully, but these errors were encountered: