Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HCI: after MTU change, gatt client's answer is not pushed to HCI (IDFGH-13790) #14648

Open
3 tasks done
danergo opened this issue Sep 30, 2024 · 97 comments
Open
3 tasks done
Assignees
Labels
Status: In Progress Work is in progress Type: Bug bugs in IDF

Comments

@danergo
Copy link

danergo commented Sep 30, 2024

Answers checklist.

  • I have read the documentation ESP-IDF Programming Guide and the issue is not addressed there.
  • I have updated my IDF branch (master or release) to the latest version and checked that the issue is present there.
  • I have searched the issue tracker for a similar issue and not found a similar issue.

IDF version.

Latest

Espressif SoC revision.

NodeMCU-ESP-32S

Operating System used.

Linux

How did you build your project?

Command line with idf.py

If you are using Windows, please specify command line type.

None

Development Kit.

NodeMCU-ESP-32S

Power Supply used.

External 5V

What is the expected behavior?

Stable operation

What is the actual behavior?

Manual power recycle is needed in every 12hrs.

hciconfig hci0 reset is timing out.
btattach also times out.

Watchdog enabled but it is not triggering a reset.
Coredump enabled but no coredump is being written.
Verbose logging also enabled but only few log items are shown.

Steps to reproduce.

Ble ad scanning with hardware filtering (based on device mac and ad) at least 8 devices.

In every 5mins, try connecting to a standard (not ble) devices (which is out of range) - so connection will have to fail always.

Occasionally connect to a ble device (which is in range and shall be succeed).

Every 12 hours (roughly) we have to manually reset the esp. Otherwise hci0 will eventually go down.

Before hci0 going down, we can still try connecting to a ble device but we can't receive longer data from it.

(Ble device asks us an MTU increase, and we accept it, but then we can't receive data: but this happens ONLY after 10-12 hours of constant stressing esp with the above advertising scaninngs and 5mins inactive device connect trials).

I guess some buffer is overfilling but I couldnt enable any practical logging in menuconfig.

What do you suggest?

Debug Logs.

No response

More Information.

No response

@danergo danergo added the Type: Bug bugs in IDF label Sep 30, 2024
@espressif-bot espressif-bot added the Status: Opened Issue is new label Sep 30, 2024
@github-actions github-actions bot changed the title HCI dual-mode controller BTDM with many devices: unstable! HCI dual-mode controller BTDM with many devices: unstable! (IDFGH-13790) Sep 30, 2024
@danergo
Copy link
Author

danergo commented Oct 3, 2024

Anyone please?! This is unbelievable. HCI times out in every single 10-12 hours and NO log is dumped/printed on debug serial.

I'm now on latest latest latest idf, with latest latest lib.

@danergo
Copy link
Author

danergo commented Oct 3, 2024

If I disable le ad filtering, esp32 can't survive for more than 3hrs. it needs to be RESETTED in every single 3 hrs.

Please someone say something! NO coredump, NO debug log, NO nothing on serial console. Only physical HW RESET is the solution for 3hrs.

@danergo
Copy link
Author

danergo commented Oct 4, 2024

I got some updates - although you don't care - shame on you.

After constant stress load: LE scanning and occasionally gatt connection to a device.

Gatt device wants to increase MTU from 23 to 247 in every connection. We accept this.

But after a couple of hours, this is simply NOT working anymore: gatt device not answering.

Now big update! After the stress test fails: I can still connect manually to this gatt device (via gatttool)!!! Gatttool refuses accepting MTU request so it stays 23 and gatt device will also answer to this.

I believe gatt device would answer anyway (no matter weather we accept/refuse its MTU change request), but esp is NOT putting it into hci anymore until it gets a reset.

Does this makes sense to you?

So:
Phase1: test: gatt client changes MTU, we accept, gatt client replies and we receive it
.
.
.
After 10hrs
Phase2: test: gatt client changes MTU, we accept, gatt client MIGHT answer but we NEVER receives it via hci.
Phase2: workaround: don't accept the MTU request and gatt client's reply is arriving to hci.

Is it some buffer handling problem in esp32 lib?

Please folks.

@danergo danergo changed the title HCI dual-mode controller BTDM with many devices: unstable! (IDFGH-13790) HCI: after MTU change, gatt client's answer is not pushed to HCI (IDFGH-13790) Oct 4, 2024
@danergo
Copy link
Author

danergo commented Oct 4, 2024

Note: gate's reply is not received in either case:

  1. When we accept client's mtu
  2. When we initiate mtu change from server (with gatttool mtu 247).

Only in one single case will gatt client's reply reach us: when we don't accept the new mtu at all (AND we don't request MTU increase either)

@danergo
Copy link
Author

danergo commented Oct 6, 2024

Anyone knows anybody who can provide me the source of libbtdm_app.a, or I will have to reverse engineer it? Maybe @BetterJincheng?

@danergo
Copy link
Author

danergo commented Oct 7, 2024

Ping

@danergo
Copy link
Author

danergo commented Oct 8, 2024

pong

@esp-zhp
Copy link
Collaborator

esp-zhp commented Oct 8, 2024

@danergo
hi,I sincerely apologize for the delay in getting back to you. The recent National Day holiday in China meant that I was out of the office, and I was unable to respond to messages during that time.
I will check your issue,give me some time...

@danergo
Copy link
Author

danergo commented Oct 8, 2024

@esp-zhp: sure. Thank you!

I have some addition: when ESP starts this behavior (i.e doesn't forwards packets after mtu negotiations has been done), a 100% fix is to restart the ESP with its reset button (/en switch).
However (update!), sometimes it's enough to do:

systemctl stop bluetooth.service
systemctl stop my-btattach.service (this is running btattach to get the hci0 interface)
systemctl start bluetooth.service
systemctl start my-btattach.service

This is (I believe) sending a hci reset command which seems to solve the issue for another 10-12 hours.

@esp-zhp
Copy link
Collaborator

esp-zhp commented Oct 8, 2024

Currently, there isn't enough information for me to pinpoint the issue.
I have a few points of suspicion:
ESP32 only support BLE4.2 HCI command,but BlueZ might send latest HCI command, causing a timeout. You can refer to #12650 for more information.
Additionally, if the 'HCI reset command' doesn't resolve the issue, more debug information will be needed, such as providing all HCI commands and event data.If the 'HCI reset command' can resolve your problem, that would be ideal.

@danergo
Copy link
Author

danergo commented Oct 8, 2024

HCI reset command can resolve my issue.

Other than that, I have thousands (if not millions) of lines of "btmon" logs.

BlueZ doesn't send any extra command (at least nothing extra is visible in button logs).

In case hci reset fixes the mtu problem, how would it be ideal?

Thank you!

@danergo
Copy link
Author

danergo commented Oct 8, 2024

I have some news: while I was away from this device, ESP started behaving wrong again (same MTU problem).

But after constant thorough connection requests (1130 connection retrials for more than 3 hours!!!), it 'fixed' itself: now MTU negotiation doesn't ruin the incoming indication packets, but I believe after 10-12hours it will break again.

@esp-zhp
Copy link
Collaborator

esp-zhp commented Oct 9, 2024

If resetting the HCI resolves the MTU issue, I believe it's acceptable.
In fact, MTU doesn't have an HCI command; it's only transmitted via ACL.

@danergo
Copy link
Author

danergo commented Oct 9, 2024

Sorry, I don't think it's acceptable:

  • MTU 23->247, every party accepts it.
  • Subscribe to indications -> we got acknowledge
  • GATT write -> we got confirmation
  • GATT read -> we can read
  • Indications below 24 -> we receive it
  • Indications above 23 -> we don't receive it -> this can be due to:
    A.) ESP fails to handle, or
    B.) Device moved out from range during this exact moment

Now, this device doesn't move a single millimeter, it's staying in one place (so as the ESP).

My point here is, that from my application, I can't judge weather device timeout is due to real timeout, or ESP misbehavior.
Due to this, I can't accept that HCI reset solution is fair, sorry.

Anyway, in dmesg, I see these errors a lot:

[54034.156769] Bluetooth: hci0: Opcode 0x200d failed: -110
[54034.156891] Bluetooth: hci0: request failed to create LE connection: err -110

This is timeout, ESP doesn't answer for my requests. This happens, when ESP reaches the problematic phase.
Problematic phase: when it accept higher MTU, but won't forward longer messages to HCI.

@esp-zhp
Copy link
Collaborator

esp-zhp commented Oct 9, 2024

according to log “[54034.156769] Bluetooth: hci0: Opcode 0x200d failed: -110
[54034.156891] Bluetooth: hci0: request failed to create LE connection: err -110”
The log you provided is too limited, so I couldn’t extract much useful information. However, I’d like to remind you that if the 'LE Create Connection' command is issued, it will keep attempting to establish a connection until it's successful, as this action has a high priority. If the connection fails, you will need to send the 'LE Create Connection Cancel' command to stop the connection attempts.
ps:
BLUETOOTH CORE SPECIFICATION Version 5.4 | Vol 4, Part E page 2366
image

image

@esp-zhp
Copy link
Collaborator

esp-zhp commented Oct 9, 2024

Could you capture the packets to verify if the peer device is indeed sending Indications above 23? If the peer device doesn't send the Indications, then the ESP device wouldn't be able to receive them either.

If the issue still persists, I need more information to further diagnose the issue. Could you provide the complete HCI data?

@danergo
Copy link
Author

danergo commented Oct 9, 2024

Dear @esp-zhp:

I would happily provide any logs if it can help you with diag.

"LE Create connection" always succeeds. Also MTU exchange request (from 23 to 247) also always succeeds. Shorter characteristic writes, and their confirmations also always succeeds. Longer characteristics writes and their confirmations also always succeeds. Shorter indications also always received. Longer than 23 indications received for 10-12 hours, then they are not forwarded back to hci anymore.

I doubt the client has any problems because 2 quite hard reasons:

  1. After ESP restart, everything starts working again (Longer-than-23 indications are arriving), without restarting the client.
  2. When ESP is not forwarding the Longer indications, I tried with another bluetooth controller: it can perfectly receive the Longer indications from the client. Client hasn't changed in any way.

HCI data: I have many. All created by button -w. Is it okay for you? Can I share this privately?

@esp-zhp
Copy link
Collaborator

esp-zhp commented Oct 9, 2024

One more thing needs to be confirmed to narrow down the issue. Do you think your problem is related to classic Bluetooth? If you are not using classic Bluetooth, does the issue still persist? I'm responsible for BLE and don't have much knowledge about classic Bluetooth. If you believe the issue is related to classic Bluetooth, I can ask my colleagues who handle classic Bluetooth to assist you.

@esp-zhp
Copy link
Collaborator

esp-zhp commented Oct 9, 2024

Could you please send me all the HCI logs from the ESP32? It would be best to use GitHub so that other colleagues can also view them. If it's not convenient for you to share publicly on GitHub, you can also send them to my email ([email protected]).

and Do you have any packet capture devices on your side?It would be even better if you could capture the packets to confirm whether the ESP32 has sent an indication(ATT_HANDLE_VALUE_IND)when mtu above 23.
ps ATT_HANDLE_VALUE_IND:
image

@danergo
Copy link
Author

danergo commented Oct 9, 2024

@esp-zhp:

Sorry for the confusion! Let me clear things up!

I'm using ESP for Dual-Mode Bluetooth Controller (controller_hci_uart example from this repo).

The issue is related purely to BLE.

My RPi is connected via UART to this ESP. ESP is responsible for providing Bluetooth to this RPi (it has no onboard Bt).

RPi is attaching the ESP with btattach, therefore RPi sees hci0. BlueZ uses this hci0 interface to provide Bluetooth functionality to my app on RPi.

My app is constantly monitoring LE advertisements from a bunch of devices (about 10). No filtering is enabled. Occasionally (4-5 times per 10hrs) my app connects to a remote client with gatt-charactetistics, notifications and indications.

This occasional "LE Create Connection" always succeeds, but the client is asking an "MTU Exchange Request" after the connection is established (23->247). My app always accepts this new mtu, and responds as intended. Then my app subscribes to indications by writing a specific data to a gatt handle.
This write is confirmed by the controller, and my app gets reply from the client.
Then my app writes a long data to the client with gatt char-write-req, and this is also always gets acknowledged from ESP side over hci. Then for this large data, Client responds with a long indication (longer than 23). This long indication is arriving to my application for about 10-12 hours, then a reset to ESP is needed (sometimes, HCI reset is enough, but not always: it usually times out in this phase).

If we don't do reset, longer indications are not arriving anymore. All the rest details are provided earlier:
In case my app doesn't accept the new MTU after LE Create Connection, even the longer indication arrives (but in multiple parts: 23-23-23-10 lengths).

Does this change anything?

Thank you!

@danergo
Copy link
Author

danergo commented Oct 9, 2024

@esp-zhp: mail sent to you, would be appreciated if you could take a look at it.

@esp-zhp
Copy link
Collaborator

esp-zhp commented Oct 10, 2024

From the log you provided, the ESP32 is acting as a master role in the GAP layer and as a client role in the GATT layer. right?

In your dmesg log, it shows a connection failure at "Wed Oct 9 10:31:15 2024": "[Wed Oct 9 10:31:15 2024] Bluetooth: hci0: Opcode 0x200d failed: -110 [Wed Oct 9 10:31:15 2024] Bluetooth: hci0: request failed to create LE connection: err -110"

I tried to find the failed connection HCI command and event in the HCI log you provided, but I couldn’t locate the connection establishment command at "Wed Oct 9 10:31:15 2024."
image

There are two possibilities:

1-The timestamps in the dmesg log and HCI log do not match.
Please check why they aren’t aligned.
2-The BlueZ host didn’t send the create connection HCI command.
If the BlueZ host didn’t issue the create connection command, there might be a bug in BlueZ, and you should investigate it further from the BlueZ side.

Additionally, I didn’t find any connection failure information in the HCI log.

The last connection attempt was at 09:07:25, and it was successful. After that, the HCI log only shows continuous scanning without any further connection attempts.
ps:
09:07:25 create connect
image

scan
image

@esp-zhp
Copy link
Collaborator

esp-zhp commented Oct 10, 2024

First, please check why the BlueZ host didn’t issue the create connect HCI command.

Do you have the ESP32 terminal logs on your side? The ESP32 terminal logs might contain some useful information.

Here’s an example of the ESP32 terminal log(I want the full log):
image

@danergo
Copy link
Author

danergo commented Oct 10, 2024

Hi, @esp-zhp:

Do you have the ESP32 terminal logs on your side? The ESP32 terminal logs might contain some useful information.

Yes, I do have esp terminal logs yes, will attach here.

First, please check why the BlueZ host didn’t issue the create connect HCI command.

It is issuing the "LE Create Connection", but "btmon" recorded the timestamps in UTC, while "dmesg" output has 2 hours later timings.

From the log you provided, the ESP32 is acting as a master role in the GAP layer and as a client role in the GATT layer. right?

Yes, I believe this is the correct terminology.

The timestamps in the dmesg log and HCI log do not match.

That's correct, there is a 2hour difference:

I tried to find the failed connection HCI command and event in the HCI log you provided, but I couldn’t locate the connection establishment command at "Wed Oct 9 10:31:15 2024."

You will find it at 08:31:15 in the HCI log (packet no: 37761: LE Create Connection).

Additionally, I didn’t find any connection failure information in the HCI log.
The last connection attempt was at 09:07:25, and it was successful. After that, the HCI log only shows continuous scanning without any further connection attempts.

Please check the problematic parts: from 15261 - 47498 (07:03:07 - 09:07:25 in HCI log): constant, thorough trying of connection. LE Create Connection succeeds! GATT writes confirmed! Short GATT Indications are received! But not a single sing of the long indication.

You can see a working example starting in 14584 (06:59:55,57) please pay attention to long indication for this connection in 14611 (06:59:55,69): 86 bytes long, response for our "0x0092" GATT write request in 14608 (06:59:55,64).

Connection to this client is always done by:
1.) We send LE Create Connection
2.) We Accept (by default)/Decline the Client's new MTU of 247
3.) We send GATT Write Request (Handle: 0x0093) -> Write Response received from Client (short length)
4.) We send GATT Write Request (Handle: 0x0092) -> Write Response received from Client (short length)
5.) Client send GATT Handle Value Indication (Handle: 0x0092), long length -> We send Handle Value Confirmation
6.) Disconnect

Now, with correct sequences at the beginning of the HCI logs, you can investigate this behavior, then you will see the problematic parts from 15261 - 47498 (07:03:07 - 09:07:25 in HCI log): step5 is missing. There are more than 2 hours, and more than 30000 HCI Packets trying to connect to this Client. During this period, any other device can perfectly connect to the same Client (therefore Client is behaving correctly). Also, during this period, in case I deny the MTU change, ESP will forward the indication in step5, with multiple indications (23-23-23-17).

Thank you very much, I appreciate your time spent on this.

esp.log

@esp-zhp
Copy link
Collaborator

esp-zhp commented Oct 10, 2024

1-create connection
In the HCI log at line 37761, I noticed that the connection was not successfully established. I also observed that the BlueZ host canceled the connection after 4 seconds.

I believe it’s normal for a connection to occasionally fail within 4 seconds, as there could be interference over the air.
image

I reviewed the subsequent logs and saw that the connection was successfully established later. Therefore, I didn’t find any issues.
image
image

2-GATT writes confirmed! Short GATT Indications are received! But not a single sing of the long indication.
At line 14593, everything works well, but at line 15312, there is no indication.

I don’t believe this is an issue with the ESP32.
For the ESP32 controller, it does not differentiate between the types and contents of ATT packets; all packets are reported to the host for processing. The ESP32 controller did not receive the indication, because other ATT packets are still being reported to the host normally.
I think the peer device didn’t send the indication at all, so the ESP32 controller couldn’t receive it. BlueZ wouldn't receive the indication either.

line 14593(work well)
image

line 15312(no indicate)
image

You should check why the peer device didn’t send the indication.
you can debug why the peer device did not send the indication based on the following points:

1-Why does the peer device consistently send MTU requests?
2-Are the contents of the write operations we perform on the peer device all correct?
I suspect this is an issue with the application layer logic, not a bug in the Bluetooth protocol stack.
I noticed that the write content sent by the ESP32 is not entirely consistent. For example:
image
image

@esp-zhp
Copy link
Collaborator

esp-zhp commented Oct 27, 2024

I test it,it works.
extern void esp_log_enable_hcievt_log(bool enable);
extern void esp_log_enable_hcicmd_log(bool enable);
esp_log_enable_hcievt_log(0);
esp_log_enable_hcicmd_log(0);
log:
image

@danergo
Copy link
Author

danergo commented Oct 27, 2024

Thank you, compiled, and works.

1 day passed without any issue after retesting began.

@danergo
Copy link
Author

danergo commented Oct 29, 2024

@esp-zhp, it failed again, (with logs disabled). So I believe this is indeed some timing issue.

Will share you the logs soon.

@danergo
Copy link
Author

danergo commented Oct 29, 2024

@esp-zhp: logs shared.

@esp-zhp
Copy link
Collaborator

esp-zhp commented Oct 30, 2024

@danergo
I think this might be related to the connection parameter update.
Could you run a test where,
1-after establishing a connection, update the connection parameters. wait the update is completed,and then perform a write operation.

2-If the "Sent Write Request, Handle: 0x0092" is not successfully sent, the application layer should retry after 5 seconds instead of immediately disconnecting. If the second attempt also fails, disconnect after another 5 seconds.

3-don‘t use my lib, origin lib is ok
image

thanks.

@danergo
Copy link
Author

danergo commented Oct 30, 2024

@esp-zhp:

My code has no influence or knowledge of L2CAP messages.

It just:

  1. Connects, then AFTER it succeeded:
  2. Sends out a characteristic-write-req to 0x0093 with 0200 data, then AFTER confirmation has been received:
  3. Sends out a long characteristic-write-req.

Moreover, in case I try manually (with gatttool): this means im typing manually, no script involved here:
1.) Connect, then wait until connected
2.) Mtu 247 (gatttool denies client's mtu increase request, but this command sets it correctly), wait for confirm
3.) Char-write-req 0x0093 0200, wait for confirm
4.) Char-write-req long

This WORKS for 10-12hrs, but then suddenly long writes are not sent out anymore, UNTIL esp gets reset. And as you can see gatttool has no info on l2cap.

Now I'm more than sure gatttool is not changing anything under the hood after 10-12 hrs, moreover I don't have to restart the Linux, only the esp to get back to normal.

Moreover in case we enable the verbose log in your library, it hides this problem.

Again, this is something we can't control, and now it's more than likely that esp behaves different after running 10-12 hrs.

We can solve this by either running it wit verbose log, or restart it every hour. Neither is too effective.

What do you think?

@esp-zhp
Copy link
Collaborator

esp-zhp commented Oct 30, 2024

you may not have understood what I meant.
1-I**'ve noticed the issue:** your write request wasn't sent as expected.
2-I suspect this is related to the connection parameter update. so I'd like you to wait until the connection parameter update completes before performing the write operation, rather than writing immediately.-If the "Sent Write Request, Handle: 0x0092" is not successfully sent, the application layer should retry after 5 seconds instead of immediately disconnecting. If the second attempt also fails, disconnect after another 5 seconds.

-> I believe this approach might solve your issue.

@danergo
Copy link
Author

danergo commented Oct 30, 2024

Dear @esp-zhp:

I understand your points, let me rephrase my concerns here, reacting on yours:

1-I**'ve noticed the issue:** your write request wasn't sent as expected.

Please see these two below screenshots, taking into consideration my labels and explanations:

(For others: each screenshot consist of two logs: upper is recorded with a sniffer, and bottom is logged directly on the RPi which ESP32 is attached to)

Good commincation trial steps:

# What? Source Destination SniffID DirectID
1 Exchange MTU Request from Slave to Master 2442 5228
2 Exchange MTU Response from Master to Slave 2446 5230
3 Write Request (0x0093) from Master to Slave 2448 5231
4 Connection Parameter update request from Slave to Master 2449 5234
5 Write Response (for 0x0093) from Slave to Master 2452 5241(!!!) Look at this!
6 Connection Parameter update response from Master! to Slave 2453 5237(!!!)
7 Write Request (0x0092) from Master to Slave 2457 5242

Key takeaways for good trial:

  1. slave sends Connection Parameter update request and does NOT wait for master to reply, but it also sends Write Response. We can not control this behavior, as client is an active product on the market.
  2. Master first sends out the Connection Parameter update response, then it sends out the Write Request
  3. Communication succeeds, 0x0092 Write request is being sent out (please see 7 in above table and in both sniffer and local log below!)

goodtrial

Okay, let's move on to the stage when ESP starts behaving wrong:

Failed commincation trial steps:

# What? Source Destination SniffID DirectID
1 Exchange MTU Request from Slave to Master 106498 202392
2 Exchange MTU Response from Master to Slave 106503 202394
3 Write Request (0x0093) from Master to Slave 106505 202395
4 Connection Parameter update request from Slave to Master 106506 202399
5 Write Response (for 0x0093) from Slave to Master 106508 202406(!!!) Look at this!
6 Connection Parameter update response from Master! to Slave 106509 202402(!!!)
7 Write Request (0x0092) from Master to Slave !!MISSING!! 202407

Key takeaways for failed trial:

  1. Sequence is exactly the same in both good and failed trials! Only difference is that step7 is missing in snifflog, meaning ESP failed to transfer it out. Please see it in log below: local log shows, but sniffle log doesn't.
  2. Please also, refer back on the good trial: we don't wait there for "LE Connection Update Complete" message before sending out the long GATT Write Request, meaning ESP Controller handles the same sequences differently in these 2 scenarios.

failedtrial

Additional notes, regarding your recommendation of

If the "Sent Write Request, Handle: 0x0092" is not successfully sent, the application layer should retry after 5 seconds instead of immediately disconnecting.

  1. It's not the application layer, but the Client which is automatically disconnecting after a short period in case not hearing from us.
  2. No matter how many times we try, long writes are NOT being sent out at all until we reboot ESP.

For 2.) please see this new log (btmon_finalproof.cap, shared with your email too):

I followed these steps now:

  1. Connect to Client
  2. Set MTU (as gatttool by default denies the request from Client)
  3. Send sample data with length of 63 bytes 5 times: ALL GOOD
  4. Send sample data with length of 64 bytes controller doesn't ACK it (see ID#53)!

image

Key takeaway: This is the root cause of the problem: after 10-12 hours, ESP's controller does NOT respect MTU changes anymore, leaving the RX buffer on default 63 bytes therefore ESP is NOT able to process any longer messages until it's reset.

I hope you accept this as proof. To me this looks straightforward, but please let me know in case you need any further information/clarification. I really appreciate your help and efforts!

Thank you very much

@danergo
Copy link
Author

danergo commented Oct 31, 2024

leaving the RX buffer on default 63 bytes

I meant HCI rx buffer here.

@danergo
Copy link
Author

danergo commented Nov 1, 2024

@esp-zhp: did you have time for checking my last details here?

@esp-zhp
Copy link
Collaborator

esp-zhp commented Nov 1, 2024

@danergo
sorry for later reply,I'm still analyzing the cause. After adding the HCI log, the issue stopped occurring, which complicates debugging. I'll add more debug logs to investigate further.

@esp-zhp
Copy link
Collaborator

esp-zhp commented Nov 1, 2024

sniffer:
It is evident that the write request was not sent on time. However, it was still sent.
image
image

If there is no response within 3 seconds, could you resend the write request?

@danergo
Copy link
Author

danergo commented Nov 1, 2024

Unfortunately I can't resend, as the remote Client is disconnecting after about 1sec in case it doesn't hear from us. You only see this terminate command, because my app doesn't receive any notification about client's disconnect, so we implemented a (max)5 second timeout: after 5seconds, it is 100% impossible that we are still connected so it's safe to drop and close the connection.

Client is on battery so it has limited time for communication.

What is strange for me is that 63byte threshold: it really seems some default threshold. I can send anything below 64 length, and nothing from 64bytes length. That should be a great way to start finding some spots :)

@danergo
Copy link
Author

danergo commented Nov 6, 2024

Hi, @esp-zhp: do you have any news?

@danergo
Copy link
Author

danergo commented Nov 7, 2024

@esp-zhp: i have a new idea:

Creating a firmware in which we can turn on/off logs dynamically via debug uart (i.e.: not hardcode log setting into firmware). In this case we can start with hci logging turned off, then once issue occurs, we can turn on hci logging over debug port, without rebooting esp.

What do you think? Can this help? How shall we read from debug uart without disturbing the btdm controller?

@esp-zhp
Copy link
Collaborator

esp-zhp commented Nov 8, 2024

@danergo
Sorry for the late reply again.

Based on your description, there is a version with HCI logging enabled. If HCI logging is always on, it works fine, but power consumption will be higher. For now, you can use the library with HCI logging enabled to continue your development.

You mentioned: “Creating a firmware that allows us to dynamically turn logging on and off via the debug UART (i.e., without hardcoding log settings into the firmware). In this case, we can start with HCI logging turned off, and then, if an issue occurs, enable HCI logging through the debug port without rebooting the ESP.”

I’m not sure if this approach will work, as I haven’t yet identified the root cause of the issue. I will update the library and plan to add debug information for suspected areas. When an issue occurs, you can dump this debug information (I’ll provide a new API for this), and I’ll use the data to help resolve the problem.

I’m still working on the debug library, and this may take some time. In the meantime, please proceed with your testing and development.

@danergo
Copy link
Author

danergo commented Nov 8, 2024

Okay, thank you!

@danergo
Copy link
Author

danergo commented Nov 12, 2024

Hi, @esp-zhp: did you have some time to progress?

We are running our main service with HCI logs enabled, and it's smooth. But I think this shall not be considered as a final solution.

Thank you!

@esp-zhp
Copy link
Collaborator

esp-zhp commented Nov 13, 2024

yes,I will give you lib this week

@danergo
Copy link
Author

danergo commented Nov 13, 2024

Dear @esp-zhp:

We are sending you a new "proof" capture: this time with:

  • local btmon log (on RPi which ESP is attached to)
  • minicom debug log (ESP's debug log), with full HCI log enabled!

This contain a LOT of garbage, please forgive us. Please focus on the ATT frames. You will see normal procedures of GATT writes and confirmations. Focus on 0x0092 Write Response!

You'll see: 0x0092 responses are coming until ID#634057 (last Write Response in log). Afterwards, you'll see no more Write Responses (for 0x0092).

As this time we have the debug log enabled, we can state, that simply enabling "HCI verbose log" doesn't solve this problem. (Luckily! At least, logging is not hiding this bug :) )

Please check our logs, maybe you'll see something interesting which can help you with this topic.

@esp-zhp
Copy link
Collaborator

esp-zhp commented Nov 15, 2024

Here’s the English translation and refinement:


I analyzed the HCI log file “minicom_newproof.cap” that you provided. Using internal tools, I converted the HCI log into a BTsnoop file format, which is compatible with software like Wireshark and "Wireless Protocol Suite."

At timestamp 654,633, there is an ATT_WRITE_REQ with handle 147:

ATT_WRITE_REQ  147  Not Mapped  0(C)  Unknown [3]  14  00:00:00.000063  2024/11/14 11:29:20.135370  147

I noticed that the controller did not subsequently receive any HCI data related to Attribute Handle: 146. In other words, the controller did not receive the write request (handle 0x92) issued by the Bluz host. This behavior appears abnormal.
I don’t fully understand this behavior. Why didn’t the host continue to send the write request (handle 0x92)? Or did it send the request, but the controller simply didn’t receive it?
image

ps:HCI log from ESP32(decoded)
I will send you by email to protect your privacy.

@esp-zhp
Copy link
Collaborator

esp-zhp commented Nov 15, 2024

From the HCI log on your host side, it appears that Bluz sent a write request (handle 0x92), but the ESP32 controller either did not send the packet or failed to send it promptly. This seems to be an issue on the ESP32 controller's side.

However, from the ESP32 controller's perspective, it did not receive the write request (handle 0x92).

I’m unsure how the HCI log on Bluz was captured or whether it is entirely accurate.
If this issue occurs, could restarting Bluz while keeping the ESP32 controller running resolve the problem?

@danergo
Copy link
Author

danergo commented Nov 17, 2024

@esp-zhp:

I noticed that the controller did not subsequently receive any HCI data related to Attribute Handle: 146. In other words, the controller did not receive the write request (handle 0x92) issued by the Bluz host. This behavior appears abnormal.
I don’t fully understand this behavior. Why didn’t the host continue to send the write request (handle 0x92)? Or did it send the request, but the controller simply didn’t receive it?

That's exactly the root cause of the problem! Please note: all ATT 146 messages are long (longer than 63 bytes).

I have tested this many times: after ESP (or BlueZ) gets into this problematic stage, any write request (ATT handle doesn't matter) with data length above 63 won't send out (anything below or equal 63 will be sent out perfectly)!

Let's try to identify the wrong part here:
1.) BlueZ doesn't forwards the 146: can be, but highly unlikely, as btmon (part of BlueZ) still logs the long gatt writes even during the problematic stage.
2.) ESP doesn't receive the 146: can be, as there is no 146 in the decoded sniff.

For the 100% proof, let me wait for a couple of days to let the problematic stage arrive, and then I'll hook up the oscillator to the ESP's RX pin, to make you a final proof. :)

Thank you!

@danergo
Copy link
Author

danergo commented Nov 20, 2024

Hi, all (@esp-zhp).

We have concluded a final test with an O-scope tapped onto the TX pin of RPi (RX pin of ESP32).

Our assumption has been 100% validated now: RPi sends the long gatt write, but ESP32 doesn't acknowledge it.

Please see the shared log: it's a small one!

Facts:

  1. First we send a few short gatt writes (0x00001111): works OK (Write Request + Write Response)
  2. Secondly, we send a few medium length gatt writes (0x0000111222333444555): works OK (Write Request + Write Response)
  3. Lastly, we send long gatt writes (0x0000000011111111...eeeeeeeeffffffff): FAIL at ESP side!: it is not acknowledging the write.

For the last step (3), we have recorded the screen of the scope: this is the proof, that ESP's RX pin is receiving this long write (takes almost 700us), but I assume you will not see this in the debug log from ESP (shared with you too):

image

This is proved now an ESP-side bug, and it is present in the closed-source library, so we can't do anything else on our side to fix it.

@esp-zhp
Copy link
Collaborator

esp-zhp commented Nov 22, 2024

@danergo
Thank you for your feedback.

  1. You've demonstrated with an oscilloscope that the host is sending data to the ESP32 controller, which is great. However, there's still a question: the oscilloscope lacks the functionality of a logic analyzer. This means we cannot be 100% certain that the transmitted data is correct unless we use a logic analyzer to parse the UART data from the ESP32's RX pin.
    That said, we currently have strong reason to believe that the ESP32 UART is not correctly receiving the data. While this cannot be confirmed with absolute certainty at the moment, it is highly probable that the ESP32 is failing to receive the data properly.

  2. Additionally, I reanalyzed the minicom_newproof.cap file you provided, which shows all the data received by the ESP32 controller.
    I found an anomaly (which is crucial):
    Normally, ACL data begins with 0x02, as per the specification. However, during the issue, the data received by the ESP32 was missing the first 5 bytes. Within the received data, "12 92 00" indicates a write request to handle 0x92, which is the data we need!

image
image

  1. Continuing with debugging:
    (Assuming the host's transmitted data is 100% correct)
    The reason the write request wasn't sent out is that the UART failed to receive the data correctly and dropped part of it. Here are some potential points to investigate:

    3.1 Has the ESP32 entered sleep mode?
    By default, it doesn't unless you've modified the sdkconfig. If you have, could you share your sdkconfig file with us?

    3.2 Is UART flow control enabled on the ESP32?
    Please confirm whether CTS (GPIO 23) and RTS (GPIO 19) are being used. If not, UART issues may arise.

    3.3 Add some UART debugging:
    When the issue occurs (i.e., when the write request fails to send), print the UART registers. Here's a sample code snippet:

    #include "soc/uart_reg.h"
    #include "driver/uart.h"
    uint32_t int_raw_status = REG_READ(UART_INT_RAW_REG(CONFIG_BTDM_CTRL_HCI_UART_NO));
  2. A potential solution:
    The ESP32 controller supports both UART and VHCI methods.
    (Top) → Component config → Bluetooth → Controller Options → HCI mode

    Currently, the esp-idf/examples/bluetooth/hci/controller_hci_uart_esp32 example uses UART directly instead of VHCI, and it doesn't leverage the IDF's UART driver.
    For other BLE and BT examples, we typically use VHCI directly.
    Here are the relevant functions:

    /* VHCI */
    extern bool API_vhci_host_check_send_available(void);
    extern void API_vhci_host_send_packet(uint8_t *data, uint16_t len);
    extern int API_vhci_host_register_callback(const vhci_host_callback_t *callback);

    Currently, the issue lies in the ESP32 controller not receiving data (by uart) from the host correctly.
    Therefore, you can transmit HCI data via the VHCI interface instead of using the controller's UART.
    Specifically:

    • Implement a UART program using the IDF driver to receive data, then send it to the controller via the VHCI interface.
    • Similarly, data from the controller can also be returned through VHCI.

    VHCI has already undergone extensive testing, so I believe it can address your issue.

    This task can be undertaken by either you or us, though it will take some time.
    Additionally, I noticed that someone has already started working on this, but I haven’t tested their implementation yet.
    Here is the code link:
    [ESP-Hosted Network Adapter Bluetooth Slave](https://github.com/espressif/esp-hosted/blob/master/esp_hosted_ng/esp/esp_driver/network_adapter/main/slave_bt.c)

@danergo
Copy link
Author

danergo commented Nov 22, 2024

Thank you for extensive description.

I am more than sure that my long write is correct although I have shared a scope screenshot and not a signal analyser (measured shorter gatt writes, and scope showed shorter data flow).

However, we do have signal analyser as well, will provide you that evidence too.

Will also share our sdkconfig, sleep mode (as far as I remember) is enabled.

Will attach it here.

Thank you!

@esp-zhp
Copy link
Collaborator

esp-zhp commented Nov 22, 2024

Great! Make sure to print the relevant registers when an issue occurs.
Additionally:

Disable sleep mode.
Enable UART flow control.

@esp-zhp
Copy link
Collaborator

esp-zhp commented Nov 22, 2024

how do you think A potential solution?

@danergo
Copy link
Author

danergo commented Nov 22, 2024

Uart flow control is enabled.

Anyway, for VHCI we have just a slight knowledge.
How shall we attach to the BlueZ? With btattach? Will vhci present a device in /dev?

Thank you. Will get back to you with details later today.

@danergo
Copy link
Author

danergo commented Nov 22, 2024

OMG, my sdkconfig is huge, compared to example's. Please find it attached here.

Many sleep configs are enabled (which might have been modified by us). Will do now 2 things:

1.) Signal-trace the RX pin for the HCI data
2.) Recompile the controller's code with default sdkconfig (or a slightly modified, without sleep).

We have high hopes about this sleep configuration that it can solve this problem :)

Thank you!

sdkconfig.zip

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Status: In Progress Work is in progress Type: Bug bugs in IDF
Projects
None yet
Development

No branches or pull requests

3 participants