Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WiFi Mesh unstable when parent offline (IDFGH-13875) #14720

Closed
3 tasks done
michaelsimp opened this issue Oct 14, 2024 · 94 comments
Closed
3 tasks done

WiFi Mesh unstable when parent offline (IDFGH-13875) #14720

michaelsimp opened this issue Oct 14, 2024 · 94 comments
Assignees
Labels
Resolution: NA Issue resolution is unavailable Status: In Progress Work is in progress Type: Bug bugs in IDF

Comments

@michaelsimp
Copy link

Answers checklist.

  • I have read the documentation ESP-IDF Programming Guide and the issue is not addressed there.
  • I have updated my IDF branch (master or release) to the latest version and checked that the issue is present there.
  • I have searched the issue tracker for a similar issue and not found a similar issue.

IDF version.

v5.3.0

Espressif SoC revision.

Chip is ESP32-S3 (QFN56) (revision v0.2)

Operating System used.

Windows

How did you build your project?

VS Code IDE

If you are using Windows, please specify command line type.

PowerShell

Development Kit.

ESP32-S3-WROOM-1

Power Supply used.

USB

What is the expected behavior?

I expect the ESP32 to continue to run the application without crashing when the WIFI Mesh parent disappears.
If the MESH_ROOT was powered off, I expect a MESH_NODE to assume the role of MESH_ROOT
If the WIFI Router is powered off, when I restore it, I expect the mesh network to establish itself

What is the actual behavior?

  1. Sometimes these tests work perfectly. The Mesh network goes down and the nodes start scanning. If I restore the WiFi router, the Mesh network is reestablished
  2. Sometimes the Mesh network goes down, and can't recover. It doesn't crash but it doesn't scan properly and reestablish the Mesh network.
  3. Very regularly, if I power off the WiFi router the MESH_ROOT intermittently crashes OR if I power off the MESH_ROOT a MESH_NODE intermittently crash

Steps to reproduce.

  1. Power on system comprising 2 x ESP32-S3 dev boards and a Wifi router
  2. Connect a serial terminal (I am using PUTTY) to each serial port for monitoring
  3. Let the Mesh network get established and verify MESH_ROOT and MESH_NODE connected.
  4. Power off the WiFi Router
  5. In the example (logs below) the MESH_ROOT crashed

Debug Logs.

I (00:58:03.336) aWifiMesh: <MESH_EVENT_MESH_STARTED>ID:77:77:77:77:77:76
I (136546) mesh: <MESH_NWK_LOOK_FOR_NETWORK>need_scan:0x3, need_scan_router:0x0, look_for_nwk_count:1
I (00:58:03.336) aWifiMesh: This node MAC:48:ca:43:9b:53:d8
I (00:58:03.354) aWifiMesh: WiFi Mesh started successfully, heap:141084, root not fixed
WIN> I (140766) mesh: [S6]VONETS, 00:17:13:20:bd:74, channel:8, rssi:-12
I (140776) mesh: find router:[ssid_len:6]VONETS, rssi:-12, 00:17:13:20:bd:74(encrypted), new channel:8, old channel:0
I (140786) mesh: [FIND][ch:0]AP:11, otherID:0, MAP:1, idle:1, candidate:0, root:0[00:17:13:20:bd:74]router found
I (140796) mesh: [FIND:1]find a network, channel:8, cfg<channel:0, router:VONETS, 00:00:00:00:00:00>

I (00:58:07.590) aWifiMesh: <MESH_EVENT_FIND_NETWORK>new channel:8, router BSSID:00:00:00:00:00:00
W (140796) wifi:<MESH AP>adjust channel:1, secondary channel offset:1(40U)
W (140816) wifi:<MESH AP>adjust channel:8, secondary channel offset:1(40U)
I (141126) mesh: [SCAN][ch:8]AP:1, other(ID:0, RD:0), MAP:0, idle:0, candidate:1, root:0, topMAP:0[c:0,i:0][00:17:13:20:bd:74]router found<>
I (141126) mesh: 1330[SCAN]init rc[48:ca:43:9b:53:d9,-9], mine:0, voter:0
I (141136) mesh: 1368, vote myself, router rssi:-9 > voted rc_rssi:-120
I (141146) mesh: [SCAN:1/10]rc[128][48:ca:43:9b:53:d9,-9], self[48:ca:43:9b:53:d8,-9,reason:0,votes:1,idle][mine:1,voter:1(1.00)percent:1.00][128,1,48:ca:43:9b:53:d9]

I (141456) mesh: [SCAN][ch:8]AP:2, other(ID:0, RD:0), MAP:1, idle:1, candidate:1, root:0, topMAP:0[c:0,i:1][00:17:13:20:bd:74]router found<>
I (141466) mesh: [SCAN:2/10]rc[128][48:ca:43:9b:53:d9,-8], self[48:ca:43:9b:53:d8,-8,reason:0,votes:1,idle][mine:1,voter:2(0.50)percent:1.00][128,1,48:ca:43:9b:53:d9]

I (141776) mesh: [SCAN][ch:8]AP:2, other(ID:0, RD:0), MAP:1, idle:0, candidate:1, root:1, topMAP:0[c:0,i:0][00:17:13:20:bd:74]router found<>
I (141776) mesh: 7391[selection]try rssi_threshold:-78, backoff times:0, max:5<-78,-82,-85>
I (141796) mesh: [DONE]connect to parent:ESPM_3372B8, channel:8, rssi:-15, 30:30:f9:33:72:b9[layer:1, assoc:0], my_vote_num:0/voter_num:0, rc[00:00:00:00:00:00/-8/0]
I (141806) mesh: set router bssid:00:17:13:20:bd:74
I (142596) mesh: <MESH_NWK_MIE_CHANGE><><><><ROOT ADDR><><><>
I (142596) mesh: <MESH_NWK_ROOT_ADDR>from assoc, layer:2, root_addr:30:30:f9:33:72:b9, root_cap:1
I (142616) mesh: <MESH_NWK_ROOT_ADDR>idle, layer:2, root_addr:30:30:f9:33:72:b9, conflict_roots.num:0<>
I (00:58:09.409) aWifiMesh: <MESH_EVENT_ROOT_ADDRESS>root address:30:30:f9:33:72:b9
I (142616) mesh: [scan]new scanning time:600ms, beacon interval:300ms
I (142636) mesh: 2012<arm>parent monitor, my layer:2(cap:6)(node), interval:7286ms, retries:1<normal connected>
I (00:58:09.436) aWifiMesh: <MESH_EVENT_PARENT_CONNECTED>layer:1-->2, parent:30:30:f9:33:72:b9<layer2>, ID:77:77:77:77:77:76
I (00:58:09.451) mesh_netif: It was a wifi station removing stuff
Guru Meditation Error: Core  0 panic'ed (LoadProhibited). Exception was unhandled.

Core  0 register dump:
PC      : 0x4212753c  PS      : 0x00060030  A0      : 0x82127613  A1      : 0x3fcc1660
A2      : 0xffffffff  A3      : 0x00000000  A4      : 0xff000000  A5      : 0x00000001
A6      : 0x3fcc0a64  A7      : 0xff000000  A8      : 0x3c1505e4  A9      : 0x00000000
A10     : 0x3fcc0a64  A11     : 0x00000000  A12     : 0x00000101  A13     : 0x3c1505e4
A14     : 0x00000007  A15     : 0x3fcd8024  SAR     : 0x00000004  EXCCAUSE: 0x0000001c
EXCVADDR: 0xff00000c  LBEG    : 0x40056f5c  LEND    : 0x40056f72  LCOUNT  : 0xffffffff


Backtrace: 0x42127539:0x3fcc1660 0x42127610:0x3fcc16b0 0x4037e0aa:0x3fcc16d0

More Information.

My application integrates a number of IDF example programs including ip_internal_network
I went back to the example project ip_internal_network and built it unmodified, and can reproduce the same problems quite readily.

Also, for when the ESP32 nodes don't completely crash, I would like to know how to restart the Mesh network in software.
I have tried stopping the Mesh network with:
ESP_ERROR_CHECK(esp_mesh_stop());
ESP_ERROR_CHECK(esp_mesh_deinit());
ESP_ERROR_CHECK(mesh_netifs_destroy()); // I have tried with and without this line. Without it, the logs continually report:
I (135746) mesh: mesh is not started
E (00:58:02.547) mesh_netif: Received with err code 16388 ESP_ERR_MESH_NOT_START

I then try to restart the Mesh network with:
/* mesh initialization /
ESP_ERROR_CHECK(esp_mesh_init());
ESP_ERROR_CHECK(esp_mesh_set_max_layer(CONFIG_MESH_MAX_LAYER));
ESP_ERROR_CHECK(esp_mesh_set_vote_percentage(1));
ESP_ERROR_CHECK(esp_mesh_set_ap_assoc_expire(10));
/
set blocking time of esp_mesh_send() to 30s, to prevent the esp_mesh_send() from permanently for some reason /
ESP_ERROR_CHECK(esp_mesh_send_block_time(5000)); // was 30 seconds
mesh_cfg_t cfg = MESH_INIT_CONFIG_DEFAULT();
cfg.crypto_funcs = NULL;
/
mesh ID */
memcpy((uint8_t ) &cfg.mesh_id, MESH_ID, MAC_SIZE);
/
router */
cfg.channel = CONFIG_MESH_CHANNEL;

cfg.router.ssid_len = strlen(meshProvisionData.ssid);
memcpy((uint8_t *) &cfg.router.ssid, meshProvisionData.ssid, cfg.router.ssid_len);
memcpy((uint8_t *) &cfg.router.password, meshProvisionData.password, strlen(meshProvisionData.password));

ESP_ERROR_CHECK(esp_mesh_set_ap_authmode((wifi_auth_mode_t) CONFIG_MESH_AP_AUTHMODE));
cfg.mesh_ap.max_connection = CONFIG_MESH_AP_CONNECTIONS;
cfg.mesh_ap.nonmesh_max_connection = CONFIG_MESH_NON_MESH_AP_CONNECTIONS;
memcpy((uint8_t *) &cfg.mesh_ap.password, CONFIG_MESH_AP_PASSWD, strlen(CONFIG_MESH_AP_PASSWD));
ESP_ERROR_CHECK(esp_mesh_set_config(&cfg));
ESP_ERROR_CHECK(esp_mesh_start());

Doing the above when the system is running normally, often causes the ESP32's to crash with various errors
eg after start, MESH_NODE does a scan and then crashes with Guru Meditation Error: Core 0 panic'ed
I (00:22:02.752) aWifiMesh: <MESH_EVENT_FIND_NETWORK>new channel:8, router BSSID:00:00:00:00:00:00
W (1323864) wifi:adjust channel:1, secondary channel offset:1(40U)
W (1323874) wifi:adjust channel:8, secondary channel offset:1(40U)
I (1324184) mesh: [SCAN][ch:8]AP:2, other(ID:0, RD:0), MAP:1, idle:0, candidate:1, root:1, topMAP:0[c:0,i:0][00:17:13:20:bd:74]router found<>
I (1324184) mesh: 7391[selection]try rssi_threshold:-78, backoff times:0, max:5<-78,-82,-85>
I (1324204) mesh: [DONE]connect to parent:ESPM_3372B8, channel:8, rssi:-14, 30:30:f9:33:72:b9[layer:1, assoc:0], my_vote_num:0/voter_num:0, rc[00:00:00:00:00:00/-120/0]
I (1324214) mesh: set router bssid:00:17:13:20:bd:74
I (1324834) mesh: <MESH_NWK_MIE_CHANGE><><><><><><>
I (1324834) mesh: <MESH_NWK_ROOT_ADDR>from assoc, layer:2, root_addr:30:30:f9:33:72:b9, root_cap:1
I (1324844) mesh: <MESH_NWK_ROOT_ADDR>idle, layer:2, root_addr:30:30:f9:33:72:b9, conflict_roots.num:0<>
I (1324854) mesh: [scan]new scanning time:600ms, beacon interval:300ms
I (00:22:03.744) aWifiMesh: <MESH_EVENT_ROOT_ADDRESS>root address:30:30:f9:33:72:b9
I (1324854) mesh: 2012parent monitor, my layer:2(cap:6)(node), interval:4526ms, retries:1
I (00:22:03.771) aWifiMesh: <MESH_EVENT_PARENT_CONNECTED>layer:2-->2, parent:30:30:f9:33:72:b9, ID:77:77:77:77:77:76
I (00:22:03.785) mesh_netif: It was a wifi station removing stuff
Guru Meditation Error: Core 0 panic'ed (LoadProhibited). Exception was unhandled.

Core 0 register dump:
PC : 0x4212753c PS : 0x00060830 A0 : 0x82127613 A1 : 0x3fcc15d0
A2 : 0xffffffff A3 : 0x00000000 A4 : 0x00000278 A5 : 0x00000001
A6 : 0x3fcc09d0 A7 : 0x00000278 A8 : 0x3c1505e4 A9 : 0x3fcd778c
A10 : 0x3fcc09d0 A11 : 0x00000000 A12 : 0x00000101 A13 : 0x3c1505e4
A14 : 0x00000007 A15 : 0x3fcaa7f4 SAR : 0x00000004 EXCCAUSE: 0x0000001c
EXCVADDR: 0x00000284 LBEG : 0x40056f5c LEND : 0x40056f72 LCOUNT : 0xffffffff

Backtrace: 0x42127539:0x3fcc15d0 0x42127610:0x3fcc1620 0x4037e0aa:0x3fcc1640

I sometimes get MTX task stack overflows too when I try this, same as #13882

@michaelsimp michaelsimp added the Type: Bug bugs in IDF label Oct 14, 2024
@espressif-bot espressif-bot added the Status: Opened Issue is new label Oct 14, 2024
@github-actions github-actions bot changed the title WiFi Mesh unstable when parent offline WiFi Mesh unstable when parent offline (IDFGH-13875) Oct 14, 2024
@zhangyanjiaoesp
Copy link
Collaborator

@michaelsimp
I have tested using the ip_internal_network example, but I didn't reproduce your problem. Can you provide the .elf file when the crash issue happen? Or can you provide the core dump decode file?

@michaelsimp
Copy link
Author

michaelsimp commented Oct 24, 2024 via email

@michaelsimp
Copy link
Author

michaelsimp commented Oct 24, 2024

Hi
I did a quick set of tests today.

I created the ip_internal_network project from examples and configured as follows:
Set IDF version to 5.3.0
Set target device to ESP32-S3 with jtag integrated debugger
Set partition table Factory partition to 0x400000
Set device flash size to 16MB - matches my ESP32-S3
Set the Router SSID to "VONETS" and Password to "pass9999"
Set Panic Handler to "Print registers and halt"

Clean and Build project.
Load into 2 ESP32-S3 with serial terminals connected for monitoring.
One becomes MESH_ROOT and the other connects as MESH_NODE
Took turns at powering off the MESH_ROOT and watching the other become MESH_ROOT and then powering it back on and it connects as a MESH_NODE. This seemed to work ok today.

But what I did find easy to reproduce was:
Power off MESH_ROOT and power back on BEFORE other MESH_NODE becomes MESH_ROOT.
The original MESH_ROOT I power cycled, becomes MESH_ROOT again, but the MESH_NODE remains disconnected.

See attached files:
MESH_ROOT powered off at line 171
MESH_NODE loses connection around line 33 and never recovers

Also see .elf and .bin files in attachment. I don't know what or where the "Core dump decode file is", but these tests don't show a CPU crash. I cant run under the debugger due to the power cycle tests.

Please see attachment MESH-Testing.zip two comments down

@michaelsimp
Copy link
Author

I did some more tests which can easily cause CPU crashes.

Power on both nodes, one becomes MESH_ROOT and one MESH_NODE.
Power off WIFI router.
MESH_ROOT crashes. See file attached MESH_ROOT "Router power off.txt" (crash at end)

Second test. Power up only one node, becomes MESH_ROOT
Power off Router
This time MESH_ROOT does not crash
Power on Router
MESH_ROOT does not crash
Power on a second node which connects to the first MESH_ROOT - see line 492
MESH_ROOT crashes.
See file attached "MESH_ROOT crash on MESH_NODE connect.txt"
MESH_ Testing 2.zip

@michaelsimp
Copy link
Author

MESH-Testing.zip
This is the attachment for the first tests, 2 entries up. It did not upload properly last before

@brianignacio5
Copy link
Contributor

Hi @michaelsimp

The esp-idf vscode extension allows you to save settings in multiple places: User (Global settings for vscode), Workspace and Workspace folder (your project's .vscode/settings.json). The ESP-IDF: Show Examples command shows you the current esp-idf path used in the current vscode window. You can change where to save settings with the ESP-IDF: Select where to save configuration settings command. It sounds confusing but it does allow to use multiple projects each with different esp-idf versions (even at the same time! Using vscode workspace) More information in here

It seems the example you are trying to use have some components with specific behavior in each esp-idf version. So building a v5.2.2 example using esp-idf v5.3 might produce some compilation problems. How about creating an example using esp-idf v5.3 ?

  1. Open a vscode window.
  2. Select esp-idf v5.3 from status bar (recommended) or the ESP-IDF: Configure ESP-IDF extension.
  3. Run the ESP-IDF: Doctor command. Check that esp-idf is indeed using v5.3
  4. Run the ESP-IDF: Show examples. The esp-idf path shown should be v5.3 now
  5. Create your project from esp-idf example and try to build.

We will try to update the Show examples command to show all available esp-idf versions from esp-idf vscode extension to make this easier.

@zhangyanjiaoesp
Copy link
Collaborator

@michaelsimp The backtrace of the crash issue is here:

xtensa-esp32s3-elf-addr2line -piaf 0x4208c922:0x3fca7e60 0x4201c951:0x3fca7ee0 0x4201f96f:0x3fca7f20 0x4202345e:0x3fca7f50 0x420167bd:0x3fca7f70 -e ip_internal_network.elf

0x4208c922: parse_msg at C:\Users\micha\Documents\Cybertek\Software\ip_internal_network\build/C:/Users/micha/esp/v5.3/esp-idf/components/lwip/apps/dhcpserver/dhcpserver.c:993
 (inlined by) handle_dhcp at C:\Users\micha\Documents\Cybertek\Software\ip_internal_network\build/C:/Users/micha/esp/v5.3/esp-idf/components/lwip/apps/dhcpserver/dhcpserver.c:1190
 (inlined by) handle_dhcp at C:\Users\micha\Documents\Cybertek\Software\ip_internal_network\build/C:/Users/micha/esp/v5.3/esp-idf/components/lwip/apps/dhcpserver/dhcpserver.c:1106
0x4201c951: udp_input at C:\Users\micha\Documents\Cybertek\Software\ip_internal_network\build/C:/Users/micha/esp/v5.3/esp-idf/components/lwip/lwip/src/core/udp.c:404
0x4201f96f: ip4_input at C:\Users\micha\Documents\Cybertek\Software\ip_internal_network\build/C:/Users/micha/esp/v5.3/esp-idf/components/lwip/lwip/src/core/ipv4/ip4.c:746
0x4202345e: ethernet_input at C:\Users\micha\Documents\Cybertek\Software\ip_internal_network\build/C:/Users/micha/esp/v5.3/esp-idf/components/lwip/lwip/src/netif/ethernet.c:186
0x420167bd: tcpip_thread_handle_msg at C:\Users\micha\Documents\Cybertek\Software\ip_internal_network\build/C:/Users/micha/esp/v5.3/esp-idf/components/lwip/lwip/src/api/tcpip.c:174
 (inlined by) tcpip_thread at C:\Users\micha\Documents\Cybertek\Software\ip_internal_network\build/C:/Users/micha/esp/v5.3/esp-idf/components/lwip/lwip/src/api/tcpip.c:148

I think this issue is caused by the mismatch between your IDF version and the version in the example. Please update your version according to Brain's suggestion and test it again.

@espressif-bot espressif-bot added Status: In Progress Work is in progress and removed Status: Opened Issue is new labels Oct 29, 2024
@michaelsimp
Copy link
Author

I installed IDF version 5.3.1
I had a problem at step 2 in your instructions to entries up states: "Select esp-idf v5.3 from status bar (recommended) or the ESP-IDF: Configure ESP-IDF extension."
When I try is reports, "Open a folder first."
So I open a folder of a the project I made in version IDF 3.0. Then on the status bar I could select Version 5.3.1
When I run Run the ESP-IDF: Doctor command, I get the following errors:

  • Extension configuration report has been copied to the clipboard with errors.
  • Cannot open file ../report.txt. Detail: FIles above 50MB cannot be synchnrozied with extensions.
    I checked my report,txt and found it was over 181MB
    I tried continuing:
    When I select Show examples it only shows 5.3.1 which is good.
    I created ip_internal_network, but when completed the status bar reports ESP-IDF v5.2.2 again.
    So I tried deleting the large report.txt and trying again.
    Same problem, it created a report.txt of 181MB again

@brianignacio5
Copy link
Contributor

Delete this file:

%USERPROFILE%\.vscode\extensions\espressif.esp-idf-extension-VERSION\esp_idf_vsc_ext.log

and try to run ESP-IDF: Doctor command again. Seems that your extension log have been logging a lot and vscode limit.

About the ESP-IDF v5.2.2 again, it is because the newly created project does not set settings when created. You can select the v5.3.0 from status bar again.

Again sorry for this issue, will work to make it easier to use in the next release of esp-idf extension.

@michaelsimp
Copy link
Author

I need to make some real progress on this so I have completely uninstalled esp-idf and manually deleted all the ESP and espressif folders including all 3 IDF versions.
I have reinstalled ESP-IDF and only IDF version 5.3.1 to remove all doubt.
I will rebuild and test and report
Thanks

@michaelsimp
Copy link
Author

Hi again
Its hard to tell, but seems as if it might be a little more robust, especially with the router power off and on test.
Attachments.zip
But it still crashes, see files attached including my .elf

"Fail 1.txt" is taken from the Mesh_Root
Line 1458 Mesh_Node disconnected
Line 1495 Mesh_Node reconnects
Line 1496 crash

"Fails 2.txt" is taken from the Mesh_Node
Line 1789 Mesh_Root disconnected
Line 1822 Reconnect
Line 1832 Crash divide by zero

@zhangyanjiaoesp
Copy link
Collaborator

It's weird, I have tested it multiple times as you said (the following two cases) and it can connect normally without any crashing issues.

Power on both nodes, one becomes MESH_ROOT and one MESH_NODE. Power off WIFI router. MESH_ROOT crashes. See file attached MESH_ROOT "Router power off.txt" (crash at end)

Second test. Power up only one node, becomes MESH_ROOT Power off Router This time MESH_ROOT does not crash Power on Router MESH_ROOT does not crash Power on a second node which connects to the first MESH_ROOT - see line 492 MESH_ROOT crashes.

I'm using the Github IDF, and I will try to test with the vscode extension

@michaelsimp
Copy link
Author

michaelsimp commented Oct 30, 2024 via email

@zhangyanjiaoesp
Copy link
Collaborator

@michaelsimp

Are you using the completely unmodified example code during your testing?

@michaelsimp
Copy link
Author

michaelsimp commented Oct 30, 2024

Yes. I did not change anything except:

  • set target device to ESP32-S3 - Internal JTAG debug - What device are you testing with? Could this be a factor?
  • change partition table, factory partition size to 0x400000
  • Using vscode GUI menuconfig:
    • change device flash size to 16MB
    • set WiFi router SSID to VONETS and password to "pass9999"

See my source files attached where you can see most are untouched with the original install date 30/10/24 02:13pm NZ time.
Source.zip

@michaelsimp
Copy link
Author

FYI I use vscode for coding and building and sometimes JTAG debugging
Most of the time in testing I am am monitoring serial com port using Putty terminals on com ports

In addition to the crashing, sometimes a MESH_NODE will not reconnect to the MESH network. As a work around for this, I want to stop and restart the wifi mesh network? No matter what I try, my nodes intermittently crash on restart? Is this perhaps related?

I am hoping you could please answer a few questions to help me.

What are the recommended steps to stop and then restart the WiFi Mesh network? I currently have :

    ESP_ERROR_CHECK(esp_mesh_stop());
    ESP_ERROR_CHECK(esp_mesh_deinit());

but this causes a lot of error logging which stops if I add ...
ESP_ERROR_CHECK(mesh_netifs_destroy());

My restart is as follow:

void wifiMeshStart() {
    ESP_LOGW(TAG, "Wifi Mesh switch on");
    /*  mesh initialization */
    ESP_ERROR_CHECK(esp_mesh_init());
    ESP_ERROR_CHECK(esp_mesh_set_max_layer(CONFIG_MESH_MAX_LAYER));
    ESP_ERROR_CHECK(esp_mesh_set_vote_percentage(1));
    ESP_ERROR_CHECK(esp_mesh_set_ap_assoc_expire(10));
    /* set blocking time of esp_mesh_send() to 30s, to prevent the esp_mesh_send() from permanently for some reason */
    ESP_ERROR_CHECK(esp_mesh_send_block_time(30000));
    mesh_cfg_t cfg = MESH_INIT_CONFIG_DEFAULT();
#if !MESH_IE_ENCRYPTED
    cfg.crypto_funcs = NULL;
#endif
    /* mesh ID */
    memcpy((uint8_t *) &cfg.mesh_id, MESH_ID, MAC_SIZE);
    /* router */
    cfg.channel = CONFIG_MESH_CHANNEL;

    cfg.router.ssid_len = strlen(meshProvisionData.ssid);
    memcpy((uint8_t *) &cfg.router.ssid, meshProvisionData.ssid, cfg.router.ssid_len);
    memcpy((uint8_t *) &cfg.router.password, meshProvisionData.password, strlen(meshProvisionData.password));
    
    ESP_ERROR_CHECK(esp_mesh_set_ap_authmode((wifi_auth_mode_t) CONFIG_MESH_AP_AUTHMODE));
    cfg.mesh_ap.max_connection = CONFIG_MESH_AP_CONNECTIONS;
    cfg.mesh_ap.nonmesh_max_connection = CONFIG_MESH_NON_MESH_AP_CONNECTIONS;
    memcpy((uint8_t *) &cfg.mesh_ap.password, CONFIG_MESH_AP_PASSWD, strlen(CONFIG_MESH_AP_PASSWD));
    ESP_ERROR_CHECK(esp_mesh_set_config(&cfg));
    /* mesh start */
    ESP_ERROR_CHECK(esp_mesh_start());
    ESP_LOGI(TAG, "WiFi Mesh started successfully");
}

I notice with my custom application:
I have 1 MESH_ROOT and 5 MESH_NODEs spread around the office. Mesh_Node seem to connect to parents not based on the RSSI as documented.

The IDF documentation states "To prevent nodes from forming a weak upstream connection, ESP-WIFI-MESH implements an RSSI threshold mechanism for beacon frames." Is this configurable and if so where? I cant find it in the API or in MenuConfig. What is the default RSSI threshhold value?

The IDF documentation states in Preferred Parent Node "The preferred parent node is determined based on the following criteria: Which layer the parent node candidate is situated on. The number of downstream connections (child nodes) the parent node candidate currently has".
Does this mean RSSI is not part of the parent selection process?

Is it recommend to use self-organized networking or for serious applications should I manually build the mesh network? I will only have a max of 10 mesh nodes altogether but they do a reasonable amount of MQTT5 communications to the cloud.

@zhangyanjiaoesp
Copy link
Collaborator

@michaelsimp The following are the answers for your questions:

  1. What are the recommended steps to stop and then restart the WiFi Mesh network? I currently have :

        ESP_ERROR_CHECK(esp_mesh_stop());
        ESP_ERROR_CHECK(esp_mesh_deinit());
    

    Call esp_mesh_stop() is enough.

  2. but this causes a lot of error logging which stops if I add ...
    ESP_ERROR_CHECK(mesh_netifs_destroy());
    Where did you add the mesh_netifs_destory() function? What does the error log look like? Can you provide an example?

  3. Where did you call the wifiMeshStart() function?

  4. I have 1 MESH_ROOT and 5 MESH_NODEs spread around the office. Mesh_Node seem to connect to parents not based on
    the RSSI as documented.

    RSSI is not the only criterion for selecting the parent node, the layer and connections also need to be considered.

  5. The IDF documentation states "To prevent nodes from forming a weak upstream connection, ESP-WIFI-MESH implements an RSSI threshold mechanism for beacon frames." Is this configurable and if so where? I cant find it in the API or in MenuConfig. What is the default RSSI threshhold value?

    You can call this API:

    esp_err_t esp_mesh_set_rssi_threshold(const mesh_rssi_threshold_t *threshold);

  6. The IDF documentation states in Preferred Parent Node "The preferred parent node is determined based on the following criteria: Which layer the parent node candidate is situated on. The number of downstream connections (child nodes) the parent node candidate currently has". Does this mean RSSI is not part of the parent selection process?

    Same to the fourth point, selecting parent need to consider RSSI, layer and connections, the doc need to be updated.

  7. Is it recommend to use self-organized networking or for serious applications should I manually build the mesh network? I will only have a max of 10 mesh nodes altogether but they do a reasonable amount of MQTT5 communications to the cloud.

    You can use self-organized network.

@zhangyanjiaoesp
Copy link
Collaborator

@michaelsimp I can reproduce the crash using the vscode, I will check the difference between VSCode and standard IDF

@michaelsimp
Copy link
Author

Hi Zhangyanjiaoesp
This excellent news for me. Hopefully it is just something simple you will find soon and be able to offer me a fix.
Thank you

@michaelsimp
Copy link
Author

Hi Zhangyanjiaoesp

Thanks so much for taking the time to answer all my questions.

Please note THESE tests are with MY application (NOT with example program ip_internal_network) running on a network of 6 nodes - all ESP32-S3. My application has a CLI console integrated so I can trigger actions and see the responses on the COM port.

Q1 I will go back to just calling esp_mesh_stop() and see what happens.
To restart should I just be able to call ESP_ERROR_CHECK(esp_mesh_start());

Q2 Triggered from the CLI Console I was calling:

    ESP_ERROR_CHECK(esp_mesh_stop());
    ESP_ERROR_CHECK(esp_mesh_deinit());
    ESP_ERROR_CHECK(mesh_netifs_destroy());

Q3 wifiMeshStart() is also called from my CLI console

The CLI Console is started from my Mainline as is my WiFi Mesh application (built on top of the ip_internal_network source).
Triggering the Mesh Stop and Start would be called from the CLI Console thread. I am assuming this is ok and does not need any mutex protection. Please advise how I should call it if this is a problem.

Q4 I understand this

Q5 Thanks

Q6 Thanks for clarifying this but I am not finding this to be the case. I distributed some MESH_Nodes across the office with the aim of creating a multi-hop network between the far extremes. But it does not form as expected or at all well for healthy RSSI. I have nodes which are close to my Mesh_Root or 2nd layer Mesh_Nodes which are not at parent capacity numbers. When my other Mesh_Nodes do connect to these parents they provide an RSSI on the child to the parent of -35dBm. But they most frequently want connect to nodes a much longer distance away getting a RSSI of < -70dBm.

I read somewhere the default RSSI threshold is -120dBm, but I am finding nodes with RSSIs < -70dBm often lose their MQTT connection to the broker. I have an office environment and I have located the nodes approx 10 to 20 meters apart with a max of one wall between but they are not all line of site. I very much doubt I could even get a connection at RSSI less than -100dBm. I am thinking it may be a signal to noise ratio issue so I have scanned the office for WiFi channel usage and selected channel 1 on my WIFI Router as nothing else is using this channel and no other channels overlap. I know this is not an easy question to answer with precision and I appreciate the many influences, but realistic what is the ballpark min RSSI range at which I can expect a node to work reliably at what sort of distance range.

Q7 Because of my Q6 response above, I have started evaluating the example project "manual_networking" to make my MESH_NODES manually scan and select MESH_NODES with the healthiest RSSI. It sounds like you are saying the IDF framework should already be doing this?
So I am now wondering if the vscode crashing issue is also causing this to not work properly and your fix might fix both.
Should I put manual scanning and parent selection changes to one side and wait for the outcome of the vscode crashing?
I guess I would prefer to use the self-organized network as much as possible, if it works as you describe.
Please advise / confirm manual scanning and parent selection should not be required and I might need you to look at the node parent selection for healthy RSSI next.

Best regards

@zhangyanjiaoesp
Copy link
Collaborator

@michaelsimp

  1. According to the backtrace of the crash issue, it seems related to DHCP, it won't affect the mesh networking.
  2. I'm sorry I didn't quite understand your question regarding the selection of the parent node. Can you draw a picture to explain it? For example, where are nodes A, B, and C located? What level? How many child nodes are connected below? What is the RSSI of A, B, and C scanning each other? Do you expect A to connect to B but actually connect A to C?

@michaelsimp
Copy link
Author

See attached:
MeshMaps.zip

"Target Network .png" shows the walls as black lines and nodes as circles. The blue lines are approx what I was expecting to see.

It forms very randomly but with bad choices like the file "Actual example Network.png" with links and dBm in red.

My project is configured for up to 50 nodes and 3 children per node as I wanted to force some layers.

@michaelsimp
Copy link
Author

"According to the backtrace of the crash issue, it seems related to DHCP, it won't affect the mesh networking."

That may be the case with the crash issue you found, but the root cause of the vscode IDF environment may cause more than one issue. I guess you will know better when you get to the bottom of the vscode IDF environment issue

Do the diagrams I sent help you understand my issue better now?

@zhangyanjiaoesp
Copy link
Collaborator

Yes, I now understand your question. Once the ROOT node is formed, the chances for other idle nodes to connect to the root node are the same; as long as they can scan the root node within the RSSI threshold range, they can connect and become second-layer nodes. Therefore, it is reasonable for C and D to connect to A and become second-layer nodes. What is the RSSI that B, E, and F receive from A? Since each node can connect to 3 child nodes, if they are within the RSSI threshold range and root A is not yet fully connected, at least one of B, E, or F should be able to connect to A.

I think you can call esp_mesh_set_rssi_threshold() to limit the RSSI threshold for optional parents, which would allow nearby nodes to connect as much as possible. However, nodes D and E are too far from node A. If you set the same RSSI threshold for all nodes, the connection results might still not meet your expectations, unless you configure different RSSI thresholds for each node. Alternatively, you could only call the esp_mesh_set_parent() function to specify the parent of each node.

截屏2024-10-31 18 04 10

@michaelsimp
Copy link
Author

michaelsimp commented Nov 13, 2024

Hi Zhangyanjiaoesp

Note 1: This makes sense, thanks implemented.

Note 2:
My thinking was if I previously called ESP_ERROR_CHECK(esp_mesh_set_self_organized(false, false)); to scan for a closer parent, I don't have a self organized network, so I need to scan for a parent when the node gets disconnected. Is this not correct?

The second block you pointed out was taken from example project manual_networking inside function mesh_scan_done_handler(). In the else of parent_found (so !parent_found) it calls esp_wifi_scan_stop() and esp_wifi_scan_start() without esp_mesh_set_self_organized(false, false);
I have added esp_mesh_set_self_organized(false, false); before the esp_wifi_scan_stop() as per other instances of this.

Note 3: My thinking was to return the network to self organized again after the scan for better Parent. I will remove it soon (don't want to make too many changes at once) if it is not causing a problem for now. Please confirm.

I am not sure what the overall status of this is from your side, please advise:

  • Could you reproduce the broken mesh problem?
  • Did you find any problems in the library - I saw only a minor change to stack size in esp_task.h
  • Were there problems in my original main_mesh.c from internal_networking other than your test code
  • Is there a specific problem with my code in addition to not call esp_mesh_set_self_organized(false, false); before a mesh scan?

I did some tests. It is much better at establishing a MESH_ROOT when the MESH_ROOT reboots or is powered off. But I still have a few problems:
14Nov.zip

Test1 root.txt: My Mesh_ROOT crashed line 395 Guru Meditation Error

Test 2 MESH_NODE.txt: See notes at top of file. After Wifi Scan for a new parent, Became a MESH_IDLE with a child node

Why does a node become a MESH_IDLE when there are MESH_ROOT and MESH_NODES very close (around -35dBM) which are well below capacity? How do I fix them? Another wifi scan request does not fix it as you can see in the log.

Why does a MESH_NODE connect to a MESH_IDLE when there are MESH_ROOT and MESH_NODEs available with strong RSSIs? Surely this should not be allowed. This MESH_NODE was otherwise healthy and so when I rebooted its parent which was MESH_IDLE, it connected to another MESH_NODE successfully.

I started looking though your many changes to internal_networking mesh_main file and found some changes which I think are important. eg in event <MESH_EVENT_PARENT_DISCONNECTED>

image

Would it be possible for you to highlight any other important changes I need to merge into my application

ps I also has a node assert reboot on esp_mesh_set_parent(&parent, (mesh_addr_t *)&parent_assoc.mesh_id, my_type, my_layer)
It looks like ESP_ERROR_CHECK is not a good strategy. I don't know which argument could be wrong?
Any advice?

ESP_ERROR_CHECK failed: esp_err_t 0x4008 (ESP_ERR_MESH_ARGUMENT) at 0x4200ff9d
file: "./main/platform/espidf/wifi/wifiMesh.cpp" line 342
func: void findClosestParent(int)
expression: esp_mesh_set_parent(&parent, (mesh_addr_t *)&parent_assoc.mesh_id, my_type, my_layer)

abort() was called at PC 0x4037e45f on core 0

Backtrace: 0x40375f06:0x3fcd2db0 0x4037e469:0x3fcd2dd0 0x40386f5d:0x3fcd2df0 0x4037e45f:0x3fcd2e60 0x4200ff9d:0x3fcd2e90 0x420105ea:0x3fcd3150 0x42129bd2:0x3fcd31f0 0x4212a251:0x3fcd3230 0x4212a368:0x3fcd3280 0x4037f062:0x3fcd32a0


ELF file SHA256: ea203e8cd

@michaelsimp
Copy link
Author

Also the change above adding to event <MESH_EVENT_PARENT_DISCONNECTED>

        if (!esp_mesh_get_self_organized()) {
            printf(">>>%d, set true, true\n",__LINE__);
            esp_mesh_set_self_organized(true, true); // vote a new root 
        }

...undoes my switch to a close parent after manual scan.

@michaelsimp
Copy link
Author

michaelsimp commented Nov 14, 2024

After more testing and switching back modifications, I have the following summary:

I think you can ignore the earlier crash reports I posted today. I was working on something else and enabled PSRAM and I think things got a little cludgy and slow. OTA HTTPS started failing part way through. I have disabled PSRAM again and everything seems reliable again. I was not using it yet and had it configured like this, so I didn't think it would have any effect.
image
Anyway will put PSRAM on hold until later in another ticket after more tests on my side.

In event <MESH_EVENT_PARENT_DISCONNECTED> if I have this code, the networks establishes itself again reliably and everything looks quite stable. BUT: It immediately undoes my switch to a close parent after manual scan.

    case MESH_EVENT_PARENT_DISCONNECTED: {
        mesh_event_disconnected_t *disconnected = (mesh_event_disconnected_t *)event_data;
        ESP_LOGI(TAG, "<MESH_EVENT_PARENT_DISCONNECTED>reason:%d", disconnected->reason);
        mesh_layer = esp_mesh_get_layer();
        mesh_netifs_stop();
        wifiConnected = false;
        ESP_LOGW(TAG, "WiFi Disconnected");
        currentRSSI = NO_RSSI;

        printf(">>>last layer = %d, layer = %d\n", last_layer, mesh_layer);

        if (!esp_mesh_get_self_organized()) {
            printf(">>>%d, set true, true\n",__LINE__);
            esp_mesh_set_self_organized(true, true); // vote a new root 
        }
    }
    break;

Whereas if I have this code, the MESH_NODE parent switch works nicely, but the mesh breaks after the MESH_ROOT reboots or powers off.

    case MESH_EVENT_PARENT_DISCONNECTED: {
        mesh_event_disconnected_t *disconnected = (mesh_event_disconnected_t *)event_data;
        ESP_LOGI(TAG, "<MESH_EVENT_PARENT_DISCONNECTED>reason:%d", disconnected->reason);
        mesh_layer = esp_mesh_get_layer();
        mesh_netifs_stop();
        wifiConnected = false;
        ESP_LOGW(TAG, "WiFi Disconnected");
        currentRSSI = NO_RSSI;

        printf(">>>last layer = %d, layer = %d\n", last_layer, mesh_layer);

        if (disconnected->reason == WIFI_REASON_ASSOC_TOOMANY) {
            esp_mesh_set_self_organized(false, false);
            esp_wifi_scan_stop();
            scan_config.show_hidden = 1;
            scan_config.scan_type = WIFI_SCAN_TYPE_PASSIVE;
            esp_wifi_scan_start(&scan_config, 0);
        }
    }
    break;

@zhangyanjiaoesp
Copy link
Collaborator

@michaelsimp
The following answers your question:

  1. The patch I provided is the calling method I believe to be correct. I was unable to reproduce your issue locally, as the devices were relatively close to each other during my testing. I cannot deploy my test setup in the same way you have. However, I did test the scenario of whether other nodes can correctly elect a new root after a root power off, and the result was successful.

The second block you pointed out was taken from example project manual_networking inside function mesh_scan_done_handler(). In the else of parent_found (so !parent_found) it calls esp_wifi_scan_stop() and esp_wifi_scan_start() without esp_mesh_set_self_organized(false, false);
I have added esp_mesh_set_self_organized(false, false); before the esp_wifi_scan_stop() as per other instances of this.

In the manual_networking example, the esp_mesh_set_self_organized(false, false) API is called only once throughout the entire process. It ensures that the network remains non self-organized throughout. However, in your project, the esp_mesh_set_self_organized(false, false) , esp_mesh_set_self_organized(true, false), esp_mesh_set_self_organized(true, true) are all called. To ensure that the self-organized network is disabled during the user's scan, it is necessary to call esp_mesh_set_self_organized(false, false). See the doc
image

It looks like ESP_ERROR_CHECK is not a good strategy. I don't know which argument could be wrong?
Any advice?

Unless the return value check is critical, please avoid calling ESP_ERROR_CHECK, as it may cause the program to abort. This is something we discussed previously. Regarding the esp_mesh_set_parent() issue, could you provide how you have set the parameters? I can help check why the parameter error is occurring.

  1. The crash issue is caused by the error check you added when calling the esp_mesh_scan_get_ap_ie_len() function. After the AP is retrieved, it returns -1.
    image

In event <MESH_EVENT_PARENT_DISCONNECTED> if I have this code, the networks establishes itself again reliably and everything looks quite stable. BUT: It immediately undoes my switch to a close parent after manual scan.

Yes, I’ve noticed that as well. I think I need to find a better solution.

Note 3: My thinking was to return the network to self organized again after the scan for better Parent. I will remove it soon (don't want to make too many changes at once) if it is not causing a problem for now. Please confirm.

I still think this part of the code is unnecessary. Returning to the self-organized network after finding a better parent doesn’t seem meaningful, especially since you need to periodically scan for nearby APs, and disable the self-organized network during scanning. As I understand it, you only need the self-organized network when selecting the root; at other times, you prefer to actively choose a better parent, correct?

  1. I will check the log you sent and get back to you.

@zhangyanjiaoesp
Copy link
Collaborator

zhangyanjiaoesp commented Nov 14, 2024

@michaelsimp

I did some tests. It is much better at establishing a MESH_ROOT when the MESH_ROOT reboots or is powered off. But I still have a few problems: 14Nov.zip

Test1 root.txt: My Mesh_ROOT crashed line 395 Guru Meditation Error

Need the elf file when the crash occurred to check the backtrace. By the way, have you added all the following changes when you testing?

wifi_lib_s3_1104.zip
0001-fix-dhcp-add-debug-log-for-dhcp-server.zip

Test 2 MESH_NODE.txt: See notes at top of file. After Wifi Scan for a new parent, Became a MESH_IDLE with a child node

The device is idle because it has not yet connected to the network. Its child should return parent idle error.
image

Screenshot from 2024-11-14 16-12-55

@michaelsimp
Copy link
Author

Hi Zhangyanjiaoesp

I'm really a bit lost now as to where I am and what to do next.

Yes I have applied all the patches you sent me and done a clean build.

I would like to park the Guru Meditation Errors for now as they are infrequent and focus on my main problem which is how to recover after the MESH_ROOT is rebooted.

In event <MESH_EVENT_PARENT_DISCONNECTED> if I have the code below, the networks establishes itself again reliably and everything looks quite stable. BUT: It immediately undoes my switch to a close parent after manual scan.

    case MESH_EVENT_PARENT_DISCONNECTED: {
        mesh_event_disconnected_t *disconnected = (mesh_event_disconnected_t *)event_data;
        ESP_LOGI(TAG, "<MESH_EVENT_PARENT_DISCONNECTED>reason:%d", disconnected->reason);
        mesh_layer = esp_mesh_get_layer();
        mesh_netifs_stop();
        wifiConnected = false;
        ESP_LOGW(TAG, "WiFi Disconnected");
        currentRSSI = NO_RSSI;

        printf(">>>last layer = %d, layer = %d\n", last_layer, mesh_layer);

        if (!esp_mesh_get_self_organized()) {
            printf(">>>%d, set true, true\n",__LINE__);
            esp_mesh_set_self_organized(true, true); // vote a new root 
        }
    }
    break;

Whereas this, the MESH_NODE parent switch works nicely, but the mesh breaks after the MESH_ROOT reboots or powers off.

    case MESH_EVENT_PARENT_DISCONNECTED: {
        mesh_event_disconnected_t *disconnected = (mesh_event_disconnected_t *)event_data;
        ESP_LOGI(TAG, "<MESH_EVENT_PARENT_DISCONNECTED>reason:%d", disconnected->reason);
        mesh_layer = esp_mesh_get_layer();
        mesh_netifs_stop();
        wifiConnected = false;
        ESP_LOGW(TAG, "WiFi Disconnected");
        currentRSSI = NO_RSSI;

        printf(">>>last layer = %d, layer = %d\n", last_layer, mesh_layer);

//        if (!esp_mesh_get_self_organized()) {
//            printf(">>>%d, set true, true\n",__LINE__);
//            esp_mesh_set_self_organized(true, true); // vote a new root 
//        }
    }
    break;

You acknowledged this replying "Yes, I’ve noticed that as well. I think I need to find a better solution."

My outstanding questions are:

  1. Are you working on the better solution (immediate above). I don't mean to sound impatient, I just want to ensure we are on the same page.

  2. The internal_communication\main\mesh_main.c you sent me had many changes to compare. Is this just adding the parent switch code, or are there other changes here which I need to apply to my application in addition to the 3 notes you gave me.

  3. Why does a node become a MESH_IDLE when there are MESH_ROOT and MESH_NODES very close (around -35dBM) which are well below capacity? How do I fix them? Another wifi scan request does not fix it as you can see in the log.

Your replied "The device is idle because it has not yet connected to the network. Its child should return parent idle error."

But my problem is it stays this way. Is there something I should do after a node goes MESH_IDLE, or should it fix itself?

  1. Also when a node becomes MESH_IDLE and it has children (like the example above) why doesn't it drop it's children or why don't the children disconnect themselves and look for a proper connection?

Are these symptoms of the system breaking after the manual scan and set better parent followed by the MESH_ROOT reboot?
Is this something you are still working on?
Or do I have to manage this and if so, how?

@zhangyanjiaoesp
Copy link
Collaborator

2. The internal_communication\main\mesh_main.c you sent me had many changes to compare. Is this just adding the parent switch code, or are there other changes here which I need to apply to my application in addition to the 3 notes you gave me.

If you had carefully reviewed my changes, you would notice that most of the code has been transplanted from your wifiMesh.cpp file. The three points I raised are the main differences between my code and yours, and they are the changes I suggest you should make.

  1. Are you working on the better solution (immediate above). I don't mean to sound impatient, I just want to ensure we are on the same page.

I am looking into this issue, but I also need to handle other higher-priority tasks, so sometimes I can't respond to you quickly. Additionally, local testing and analyzing logs from multiple devices is quite time-consuming.

Regarding the mesh idle issue, we first need to understand how it occurs in order to find a solution. Could you provide the code you're currently using for testing? I need to confirm what code your current test results are based on.

@michaelsimp
Copy link
Author

Hi Zhangyanjiaoesp

I did spend time analyzing the changes you made to mesh_main.c with a full compare using winmerge. I just wanted to check with you that I hadn't missed anything important.

I fully understand you have other tasks and cannot respond immediately. The purpose of my question was just to check if you were still planning to help me resolve this.

See my source attached.
wincut2.zip

I have commented out the esp_mesh_set_self_organized(true, false) per your advice.
I have commented out the task which automatically searches for close nodes as I found it better to control this manually from my console using the command "wifi scan"
My event <MESH_EVENT_PARENT_DISCONNECTED> has the following code per your update. This stabilizes the mesh network when the MESH_ROOT reboots, but immediately undoes my manual switch to a closer parent.

        printf(">>>last layer = %d, layer = %d\n", last_layer, mesh_layer);

        if (!esp_mesh_get_self_organized()) {
            ESP_LOGW(TAG, "Reverting to Self Organised");
            printf(">>>%d, set true, true\n",__LINE__);
            esp_mesh_set_self_organized(true, true); // vote a new root 
        }

I really do appreciate your assistance.

espressif-bot pushed a commit that referenced this issue Nov 19, 2024
1. fix(wifi/pm): Fixed the tbtt interval update error when AP's beacon interval changed
   Closes #14720
2. fix(wifi/mesh): Enlarge the mesh TX task stack
3. fix(wifi/espnow): Added check for espnow type and length on v1.0
4. fix(wifi/mesh): Fixed delete group id error in wifi mesh
   Closes #14735
espressif-bot pushed a commit that referenced this issue Nov 19, 2024
1. fix(wifi/pm): Fixed the tbtt interval update error when AP's beacon interval changed
   Closes #14720
2. fix(wifi/mesh): Enlarge the mesh TX task stack
3. fix(wifi/espnow): Added check for espnow type and length on v1.0
4. fix(wifi/mesh): Fixed delete group id error in wifi mesh
   Closes #14735
@zhangyanjiaoesp
Copy link
Collaborator

zhangyanjiaoesp commented Nov 19, 2024

@michaelsimp

  1. Are you currently testing on v5.3.0? I recommend updating to v5.3.1 for testing, as v5.3.1 fixes the issue of the infinite loop for the [mesh_schedule.c,3130] [WND-RX]max_wnd:2, 1200 ms timeout, seqno:1231, xseqno:579, no_wnd_count:0 log.

  2. please use this change and test again.
    image

  3. Did you disable the Wi-Fi logs during your testing? I didn’t see any log entries like the one below in your logs. Please enable the Wi-Fi logs.

I (18854) wifi:new:<5,1>, old:<5,1>, ap:<5,1>, sta:<5,1>, prof:5, snd_ch_cfg:0x0
I (18854) wifi:state: init -> auth (0xb0)
I (18874) wifi:state: auth -> assoc (0x0)
I (18884) wifi:state: assoc -> run (0x10)

espressif-bot pushed a commit that referenced this issue Nov 19, 2024
1. fix(wifi/pm): Fixed the tbtt interval update error when AP's beacon interval changed
   Closes #14720
2. fix(wifi/mesh): Enlarge the mesh TX task stack
3. fix(wifi/espnow): Added check for espnow type and length on v1.0
4. fix(wifi/mesh): Fixed delete group id error in wifi mesh
   Closes #14735
@zhangyanjiaoesp
Copy link
Collaborator

@michaelsimp
I would like to discuss the rules for selecting a better parent with you. When choosing a better parent, do you only consider the RSSI value, or do you also take into account layer, assoc, and RSSI? If it’s the latter, which factor do you prioritize the most?

During my testing, I observed the following phenomenon:
Node A initially connects to the root node, becoming a layer2 node. Therefore, its parent’s layer should be 1, i.e., parent_assoc.layer = 1. However, parent_assoc is defined in the findClosestParent() function and is initially set to 6.

void findClosestParent(int num) { // after a WiFi scan
    ESP_LOGW(TAG, "findClosestParent  Current RSSI: %d", currentRSSI);
    int i;
    int ie_len = 0;
    mesh_assoc_t assoc;
    mesh_assoc_t parent_assoc = { .layer = CONFIG_MESH_MAX_LAYER, .rssi = -120 };
    wifi_ap_record_t record;
    wifi_ap_record_t parent_record = { 0, };
    parent_record.rssi = currentRSSI; // has to be better than current RSSI to change parent

If Node A scans an AP with a layer < 6 and an RSSI stronger than its current parent’s RSSI, it will adopt the new AP as a better parent. In my case, Node A found Node B, a layer2 node, as a better parent, and became a layer3 node.

I (18614) mesh_main: <MESH_EVENT_SCAN_DONE>number:2
W (18614) aWifiMesh: findClosestParent  Current RSSI: -6
I (18624) aWifiMesh: <MESH>[0]ESPM_A5B180, layer:2/4, assoc:0/2, 1, 34:85:18:a5:b1:81, channel:5, rssi:0, ID<77:77:77:77:77:76><IE Unencrypted>
>>>layer: 2,6, layer2_cap:1/0
W (18634) aWifiMesh: Closer Parent found: ESPM_A5B180  RSSI: 0
I (18644) aWifiMesh: <MESH>[1]ESPM_E0F6C0, layer:1/5, assoc:2/2, 0, 7c:df:a1:e0:f6:c1, channel:5, rssi:-6, ID<77:77:77:77:77:76><IE Unencrypted>
W (18654) aWifiMesh: <PARENT>ESPM_A5B180, layer:2/4, assoc:0/2, 1, 34:85:18:a5:b1:81, channel:5, rssi:0
I (18664) mesh: [IO]disable self-organizing<reconnect>
I (18674) wifi:state: run -> init (0x0)
I (18684) wifi:pm stop, total sleep time: 0 us / 12567464 us

I (18684) wifi:<ba-del>idx:0, tid:5
I (18684) wifi:new:<5,0>, old:<5,1>, ap:<5,1>, sta:<5,1>, prof:5, snd_ch_cfg:0x0
W (18704) wifi:<MESH AP>adjust channel:5, secondary channel offset:1(40U)
I (18704) wifi:Total power save buffer number: 16
W (18714) wifi:Password length matches WPA2 standards, authmode threshold changes from OPEN to WPA2
I (18754) mesh: [MANUAL]connect to parent:ESPM_A5B180, 34:85:18:a5:b1:81[layer:2], ID:77:77:77:77:77:76<>
I (18934) mesh_main: <MESH_EVENT_PARENT_CONNECTED>layer:2-->3, parent:34:85:18:a5:b1:81, ID:77:77:77:77:77:76

However, this actually resulted in a worse network condition for Node A, because the network path became longer by transitioning from a layer2 node to a layer3 node.

If I move the definition of parent_assoc outside the findClosestParent() function and assign it a value in the MESH_EVENT_PARENT_CONNECTED event, it would solve the issue described above.

    case MESH_EVENT_PARENT_CONNECTED: {
        mesh_event_connected_t *connected = (mesh_event_connected_t *)event_data;
        esp_mesh_get_id(&id);
        mesh_layer = connected->self_layer;
        memcpy(&mesh_parent_addr.addr, connected->connected.bssid, 6);
        parent_assoc.layer = mesh_layer - 1;

However, it would introduce a new problem: a layer 2 node would never switch its parent because its parent is the root node (layer = 1), and no other node has a layer smaller than 1.

@michaelsimp
Copy link
Author

michaelsimp commented Nov 20, 2024

Hi

In response to your 2nd to last post.

Q 1. Yes I am using 5.3.1. I completed deleted earlier SDKs 5.3.0 and 5.2 a few weeks ago to be 100% sure.

Q 2. I have tried this but no success.

    case MESH_EVENT_PARENT_DISCONNECTED: {
        mesh_event_disconnected_t *disconnected = (mesh_event_disconnected_t *)event_data;
        ESP_LOGI(TAG, "<MESH_EVENT_PARENT_DISCONNECTED>reason:%d", disconnected->reason);
        mesh_layer = esp_mesh_get_layer();
        mesh_netifs_stop();
        wifiConnected = false;
        connectionStatusLed(Provisioned); // update LED flash pattern
        ESP_LOGW(TAG, "WiFi Disconnected");
        currentRSSI = NO_RSSI;

        printf(">>>last layer = %d, layer = %d\n", last_layer, mesh_layer);
    
        if (disconnected->reason == WIFI_REASON_BEACON_TIMEOUT) {
            printf(">>>%d, set true, true\n",__LINE__);
            ESP_LOGW(TAG, "Reverting to Self Organised");
            esp_mesh_set_self_organized(true, true); // vote a new root 
        }
    }
    break;

I did some work around with the disconnect->reason codes. These are named but not described so its hard to know what they mean and how to apply them. Some of them are not even defined eg 100 & 101

On Node reboot during shutdown I get disconnect->reason codes:
8 = WIFI_REASON_ASSOC_LEAVE

On Node startup I get disconnect->reason codes:
101 = not defined
100 = not defined

I found when the MESH_ROOT is lost I get disconnect->reason codes:
209 = WIFI_REASON_SA_QUERY_TIMEOUT
101 = not defined
then it scans
and after scan sometimes I get
2 = WIFI_REASON_AUTH_EXPIRE
105 = not defined

When I manually switch to a closer parent I get disconnect->reason codes:
8 = WIFI_REASON_ASSOC_LEAVE
201= WIFI_REASON_NO_AP_FOUND
sometimes
206=WIFI_REASON_AP_TSF_RESET

On event <MESH_EVENT_PARENT_DISCONNECTED> I thought I could test if not codes 8, 201 and 206 then esp_mesh_set_self_organized(true, true); but the reason codes are inconsistent

        // if (disconnected->reason == WIFI_REASON_BEACON_TIMEOUT) {
        //     printf(">>>%d, set true, true\n",__LINE__);
        //     ESP_LOGW(TAG, "Reverting to Self Organised");
        //     esp_mesh_set_self_organized(true, true); // vote a new root 
        // }
        if ((disconnected->reason != WIFI_REASON_ASSOC_LEAVE) && (disconnected->reason != WIFI_REASON_NO_AP_FOUND) && (disconnected->reason != WIFI_REASON_AP_TSF_RESET)) {
            printf(">>>%d, set true, true\n",__LINE__);
            ESP_LOGW(TAG, "Reverting to Self Organised");
            esp_mesh_set_self_organized(true, true); // vote a new root 
        }

I don't feel comfortable with this solution unless it is endorsed by you guys, but anyway after several successful cycles, it failed again with lots of:

W (210492) mesh: [mesh_schedule.c,3130] [WND-RX]max_wnd:2, 1200 ms timeout, seqno:589, xseqno:232, no_wnd_count:0, timeout_count:23
I (210992) mesh: [SCAN][ch:1]AP:5, other(ID:0, RD:0), MAP:3, idle:1, candidate:0, root:0, topMAP:0[c:2,i:2][00:00:00:00:00:00]<>
I (210992) mesh: [FAIL][53]root:0, fail:53, normal:0, <pre>backoff:0

ending in a broken mesh again, see logs attached test 4 below.

Q3. Yes I had changed log_levels at the start of my project. I have removed this now so everything starts at default levels. You can set the levels in my console using "loglevel xxx l" where xxx is the ESP_LOG tag and l = the level.

See logs attached
Nov20.zip

Wifi manual scan switch shows MESH_NODE start up and connect,followed by a manual scan and switch to a closer parent. (No reboot of MESH_ROOT)

Test 1 is a reference with no manual scan for closer parent before reboot of MESH_ROOT and all nodes reconnect successfully

Test 2 is a number of cycles of manual scan and parent switch followed by MESH_ROOT reboot. This was successful for about 3 cycles although it seemed the struggle, before failing on the last attempt. The logs have wrapped but show the end when it gets broken

Test 3 is a more controlled failure:
Test3 Mesh_root showing power off after nodes connect and manual scan for better root
Test3 Node becomes root showing, MESH_NODE, Connects to root, Does manual scan and switch parent, Root powers off, This node becomes MESH_ROOT
Test 3 node becomes MESH_IDLE showing, MESH_NODE starts and connects to root, Power off root, MESH_NODE doesn't scan instead become MESH_IDLE
Test 3 node stays connected to parent which becomes MESH_IDLE, showing MESH_NODE connects to MESH_ROOT, Does manual scan and switch parent, Root powers off, This node doesn't scan, stays MESH_NODE connected same parent which has now become MESH_IDLE

@michaelsimp
Copy link
Author

Hi

In response to your last post:

At present I only consider the RSSI value.
The test in findClosestParent() does consider nodes already at capacity number of children, and will not try to switch to them.

My thoughts were, I am not wanting to build the mesh network from scratch as I start with a self configured network. I am only planning to make changes to nodes with poor RSSIs. So far my tests have been successful network architecture wise (when I have a fixed ROOT so I don't get the broken mesh problem).

I appreciate what you are saying and will certainly doo more testing and add more intelligence into the parent selection if necessary. I already send all my node network attributes to the MESH_ROOT where I have a table of all node and their parent, children, layers and RSSI. I could broadcast this to all nodes if necessary to enable smarter logic at the selection.

But I can't keep the manual scan and parent switch code while it causes my network to break which leaves me in a real predicament performance wise.

I really need a resolution to this as my priority.

Thanks for your ongoing help

@zhangyanjiaoesp
Copy link
Collaborator

The definition of reason code 100/101 is here:

} mesh_disconnect_reason_t;

image

@zhangyanjiaoesp
Copy link
Collaborator

I don't feel comfortable with this solution unless it is endorsed by you guys, but anyway after several successful cycles, it failed again with lots of:

You can use the reason code to categorize the issues, but this is not entirely reliable, as different scenarios may generate the same reason code.

I have tried this but no success.

Are you saying that using disconnected->reason == WIFI_REASON_BEACON_TIMEOUT for judgment is completely ineffective, or it can work but can't work as well as (disconnected->reason != WIFI_REASON_ASSOC_LEAVE) && (disconnected->reason != WIFI_REASON_NO_AP_FOUND) && (disconnected->reason != WIFI_REASON_AP_TSF_RESET)?

@zhangyanjiaoesp
Copy link
Collaborator

  1. The test 2 1/2/3/4 refer to the four devices in a single round of testing?
  2. where is test 3?
    image

@michaelsimp
Copy link
Author

Yes Test 2 1.txt through test 2 4.txt were the 4 devices on a single round of testing
Sorry here are the missing test logs from yesterday
Nov22.zip
Tests 1, 2, 3 were done on the software with:
disconnected->reason == WIFI_REASON_BEACON_TIMEOUT

test 4 was done with"
(disconnected->reason != WIFI_REASON_ASSOC_LEAVE) && (disconnected->reason != WIFI_REASON_NO_AP_FOUND) && (disconnected->reason != WIFI_REASON_AP_TSF_RESET)

Are you saying that using disconnected->reason == WIFI_REASON_BEACON_TIMEOUT for judgment is completely ineffective, or it can work but can't work as well as (disconnected->reason != WIFI_REASON_ASSOC_LEAVE) && (disconnected->reason != WIFI_REASON_NO_AP_FOUND) && (disconnected->reason != WIFI_REASON_AP_TSF_RESET)

Neither work reliably. after 2 or 3 cycles some nodes will fail and go MESH_IDLE and not scan and the network is broken.

@zhangyanjiaoesp
Copy link
Collaborator

I just reviewed the log for test3, and the device behavior is normal. The device being in the MESH_IDLE state is not permanent; it is a temporary state. Below is my analysis:

  1. At the beginning, the self-organizing network formed the following topology:
    root(53:d8) --- node A (39:d4)
    |--- node B (72:b8)
    |--- node C (5c:68)
  2. node C call manual scan, select A as the better parent, change to layer3 node (self-organized disabled, set parent)
  3. node A call manual scan, still select root as the better parent, still layer2 node (self-organized disabled, set parent)
  4. root power off
  5. node B found root leave, beacon timeout, parent disconnect, enable self-organized, change to be root
  6. node A found root leave, beacon timeout, parent disconnect, enable self-organized. However, at that moment, it was sending data, and it is trying to reconnecting, when you queried, the device was shown as in the MESH_IDLE state. I believe that if the device remains in an idle state and cannot recover, then this is an issue. However, if there are no subsequent logs, I don't consider it a problem. You cannot expect the device to always be in a non-idle state whenever the application layer checks the mesh status.

@zhangyanjiaoesp
Copy link
Collaborator

In the test4 log, the device eventually connected successfully.
image

The log you referred to is just a part of the intermediate process.
image

@zhangyanjiaoesp
Copy link
Collaborator

According to your test log, I think the disconnected->reason == WIFI_REASON_BEACON_TIMEOUT will be better than (disconnected->reason != WIFI_REASON_ASSOC_LEAVE) && (disconnected->reason != WIFI_REASON_NO_AP_FOUND) && (disconnected->reason != WIFI_REASON_AP_TSF_RESET) , because there are too many disconnect reason, and it is unreasonable to switch to self-organized mode as soon as the reason is not equal to 8, 201, or 206

@michaelsimp
Copy link
Author

How long should it take to for a MESH_IDLE to find a parent again? I am sure I waited 10s of seconds and it wasn't even scanning.
I am setting up for another run of tests where I will wait longer. I am just worried about my back trace overflowing using Putty

@zhangyanjiaoesp
Copy link
Collaborator

In the test2 logs, I cannot analyze the entire network change process as I did with test3 because the log only contains part of the information. It is curious why such a reason would occur.
image

It seems that in test2_2 and test2_3, there was no opportunity to switch to the self-organized network, and the device kept trying to connect to the originally configured parent, but the parent could not be detected.

@michaelsimp
Copy link
Author

Hi

Today I am getting problems where nodes get stuck in a loop logging forever. I can't keep my trace open long enough as I lose the start, but take my word for it please, once in this state it never comes out no matter how long (minutes). eg

I (00:02:07.516) aWifiMesh: <MESH_EVENT_PARENT_DISCONNECTED>reason:139
W (00:02:07.528) aWifiMesh: WiFi Disconnected
>>>last layer = 4, layer = -1
W (00:02:07.529) aWifiMesh: <MESH_EVENT_ROUTING_TABLE_REMOVE>remove 1, new:3
I (128652) mesh: [wifi]disconnected reason:201(), continuous:1/max:12, non-root, vote(,stopped)<><>
I (00:02:07.644) aWifiMesh: <MESH_EVENT_PARENT_DISCONNECTED>reason:201
W (00:02:07.645) aWifiMesh: WiFi Disconnected
>>>last layer = 4, layer = -1
I (128772) mesh: [wifi]disconnected reason:201(), continuous:2/max:12, non-root, vote(,stopped)<><>
I (00:02:07.769) aWifiMesh: <MESH_EVENT_PARENT_DISCONNECTED>reason:201
W (00:02:07.769) aWifiMesh: WiFi Disconnected
>>>last layer = 4, layer = -1
I (128882) mesh: 1145[xrsp:1]the asked:19, max window:2, force to increase/decrease(up) xseqno:17 for child 48:ca:43:9b:5d:20, xrsp_seqno:14, heap:101160
I (128892) mesh: 1307[recv]cidx[0]48:ca:43:9b:5d:20 xseqno loss, current/new:15/19, in:17, out:17, pending:0
I (128892) mesh: [wifi]disconnected reason:201(), continuous:3/max:12, non-root, vote(,stopped)<><>
I (00:02:07.893) aWifiMesh: <MESH_EVENT_PARENT_DISCONNECTED>reason:201
W (00:02:07.894) aWifiMesh: WiFi Disconnected
>>>last layer = 4, layer = -1
I (129022) mesh: [wifi]disconnected reason:201(), continuous:4/max:12, non-root, vote(,stopped)<><>
I (00:02:08.018) aWifiMesh: <MESH_EVENT_PARENT_DISCONNECTED>reason:201
W (00:02:08.019) aWifiMesh: WiFi Disconnected
>>>last layer = 4, layer = -1
I (129142) mesh: [wifi]disconnected reason:201(), continuous:5/max:12, non-root, vote(,stopped)<><>
I (00:02:08.143) aWifiMesh: <MESH_EVENT_PARENT_DISCONNECTED>reason:201
W (00:02:08.144) aWifiMesh: WiFi Disconnected
>>>last layer = 4, layer = -1
I (129272) mesh: [wifi]disconnected reason:201(), continuous:6/max:12, non-root, vote(,stopped)<><>
I (00:02:08.268) aWifiMesh: <MESH_EVENT_PARENT_DISCONNECTED>reason:201
W (00:02:08.269) aWifiMesh: WiFi Disconnected
>>>last layer = 4, layer = -1

Nov 25.zip

I found I am the author of one way that this can happen, if I scan and switch parents manually. See test1 logs attached where
node A MAC 48:27:e2:18:39:80 switches no node B 48:ca:43:9b:5d:20
Bode B MAC:48:ca:43:9b:5d:20 switches no node A 48:27:e2:18:39:80

This is one cause of the above. I think I can fix this by checking that I am not switching the nodes parent to one of its children.
I think it probably also makes sense to not swap to a parent node which has a higher layer than this node too.
But while not ideal that I am doing this, it shouldn't result in the node getting stuck in a disconnect loop?

But Test 2 looks the same problem but is not triggered by the above. MESH_ROOT is powered off. A panic crash on a MESH_NODE 48:ca:43:9b:5d:20 which still happens from time to time, but my bigger concern is that after this, MESH_NODE MAC: 48:27:e2:18:39:80 gets stuck in the disconnect loop

Are you able to reproduce these problems with the ip_internal_network you modified a week or so back? I get lost in all of this and feel we would make better progress if you were able to test, analyze and debug directly.

@michaelsimp
Copy link
Author

Hi again
Regarding your analysis of test 3 specifically node A where you said.
node A found root leave, beacon timeout, parent disconnect, enable self-organized. However, at that moment, it was sending data, and it is trying to reconnecting, when you queried, the device was shown as in the MESH_IDLE state. I believe that if the device remains in an idle state and cannot recover, then this is an issue. However, if there are no subsequent logs, I don't consider it a problem. You cannot expect the device to always be in a non-idle state whenever the application layer checks the mesh status.
The disconnect came at 00:01:18
I (00:01:18.371) aWifiMesh: <MESH_EVENT_PARENT_DISCONNECTED>reason:200
I stopped the log after 00:01:58, 40 seconds later and nothing was happening, no visible signs of scanning for a parent.
I am sure that when this occurs, no matter how long I leave it, it does not recover.
Also when it gets stuck in the disconnect loop, logs the repetitive sequence (above) indefinitely.

@michaelsimp
Copy link
Author

michaelsimp commented Nov 25, 2024

Two posts back I wrote:

I found I am the author of one way that this can happen, if I scan and switch parents manually. See test1 logs attached where
node A MAC 48:27:e2:18:39:80 switches no node B 48:ca:43:9b:5d:20
Bode B MAC:48:ca:43:9b:5d:20 switches no node A 48:27:e2:18:39:80

This is one cause of the above. I think I can fix this by checking that I am not switching the nodes parent to one of its children.
I think it probably also makes sense to not swap to a parent node which has a higher layer than this node too.
But while not ideal that I am doing this, it shouldn't result in the node getting stuck in a disconnect loop?

When I looked at the code, I am finding it difficult to decipher the variables

  • parent_record is the best parent candidate found so far
  • assoc I think is each node found in the scan esp_mesh_scan_get_ap_record(&record, &assoc); is this correct?
  • parent_assoc I am not clear on what this is or how it gets set.
    It is initialized to: mesh_assoc_t parent_assoc = { .layer = CONFIG_MESH_MAX_LAYER, .rssi = -120 }; as a worst case record
    and updated to contents of assoc when a better parent is found

the original source (taken from the example project "manual_networking") seems to be already checking :
if (assoc.layer < parent_assoc.layer || assoc.layer2_cap < parent_assoc.layer2_cap) {
But I am not sure if this stops a MESH_NODE selecting a child as a new parent, or do I need to add the line:
if (esp_mesh_get_layer() >= assoc.layer)

Could you take a look at the test 1 logs as they do appear to be setting parent to each other.

The entire routine is currently as follows if you could check and make any changes please.

void findClosestParent(int num) { // after a WiFi scan
    ESP_LOGW(TAG, "findClosestParent  Current RSSI: %d", currentRSSI);
    int i;
    int ie_len = 0;
    mesh_assoc_t assoc;
    mesh_assoc_t parent_assoc = { .layer = CONFIG_MESH_MAX_LAYER, .rssi = -120 };
    wifi_ap_record_t record;
    wifi_ap_record_t parent_record = { 0, };
    parent_record.rssi = currentRSSI; // has to be better than current RSSI to change parent
    bool parent_found = false;
    mesh_type_t my_type = MESH_IDLE;
    int my_layer = -1;
    wifi_config_t parent = { 0, };
    wifi_scan_config_t scan_config = { 0 };

    for (i = 0; i < num; i++) { // iterate through scan records looking for eligible closer parent node
        ESP_ERROR_CHECK(esp_mesh_scan_get_ap_ie_len(&ie_len));
        ESP_ERROR_CHECK(esp_mesh_scan_get_ap_record(&record, &assoc));
        ESP_LOGD(TAG, "ie_len: %d  sizeof(assoc): %d", ie_len, sizeof(assoc));
        if (ie_len == sizeof(assoc)) {
            ESP_LOGI(TAG,
                     "<MESH>[%d]%s, layer:%d/%d, assoc:%d/%d, %d, "MACSTR", channel:%u, rssi:%d, ID<"MACSTR"><%s>",
                     i, record.ssid, assoc.layer, assoc.layer_cap, assoc.assoc, assoc.assoc_cap, assoc.layer2_cap, MAC2STR(record.bssid),
                     record.primary, record.rssi, MAC2STR(assoc.mesh_id), assoc.encrypted ? "IE Encrypted" : "IE Unencrypted");

            // ESP_LOGI(MESH_TAG, "Type: %d  layer_cap %d:  assoc %d  assoc_cap: %d  rssi: %d", assoc.mesh_type, assoc.layer_cap, assoc.assoc, assoc.assoc_cap, record.rssi);
            if (assoc.mesh_type != MESH_IDLE && assoc.layer_cap && assoc.assoc < assoc.assoc_cap) { 
                // ESP_LOGI(MESH_TAG, "assoc.layer: %d  parent_assoc.layer %d:  assoc.layer2_cap %d  parent_assoc.layer2_cap: %d", assoc.layer, parent_assoc.layer, assoc.layer2_cap, parent_assoc.layer2_cap);
                if (assoc.layer < parent_assoc.layer || assoc.layer2_cap < parent_assoc.layer2_cap) {
                    if (record.rssi > parent_record.rssi) { // closer parent found
                        if (memcmp(parent_record.bssid, record.bssid, MAC_SIZE) != 0) { // dont switch to same parent
                            ESP_LOGW(TAG, "Closer Parent found: %s  RSSI: %d", record.ssid, record.rssi);
                            parent_found = true;
                            memcpy(&parent_record, &record, sizeof(record));
                            memcpy(&parent_assoc, &assoc, sizeof(assoc));
                            if (parent_assoc.layer_cap != 1) {
                                my_type = MESH_NODE;
                            } else {
                                my_type = MESH_LEAF;
                            }
                            my_layer = parent_assoc.layer + 1;
                            // break; // MSB removed, keep searching for the closest parent
                        }
                    }
                }
            }
        } else {
            ESP_LOGD(TAG, "[%d]%s, "MACSTR", channel:%u, rssi:%d", i, record.ssid, MAC2STR(record.bssid), record.primary, record.rssi);
        }
    }

    esp_mesh_flush_scan_result();
    if (parent_found) { // parent: Both channel and SSID of the parent are mandatory
        parent.sta.channel = parent_record.primary;
        memcpy(&parent.sta.ssid, &parent_record.ssid, sizeof(parent_record.ssid));
        parent.sta.bssid_set = 1;
        memcpy(&parent.sta.bssid, parent_record.bssid, 6);
        if ((my_type == MESH_NODE) || (my_type == MESH_LEAF) || (my_type == MESH_IDLE)) {
            ESP_ERROR_CHECK(esp_mesh_set_ap_authmode(parent_record.authmode));
            if (parent_record.authmode != WIFI_AUTH_OPEN) {
                memcpy(&parent.sta.password, CONFIG_MESH_AP_PASSWD, strlen(CONFIG_MESH_AP_PASSWD));
            }
            ESP_LOGW(TAG,
                     "<PARENT>%s, layer:%d/%d, assoc:%d/%d, %d, "MACSTR", channel:%u, rssi:%d",
                     parent_record.ssid, parent_assoc.layer,
                     parent_assoc.layer_cap, parent_assoc.assoc,
                     parent_assoc.assoc_cap, parent_assoc.layer2_cap,
                     MAC2STR(parent_record.bssid), parent_record.primary,
                     parent_record.rssi);
            esp_err_t err = esp_mesh_set_parent(&parent, (mesh_addr_t *)&parent_assoc.mesh_id, my_type, my_layer);
            switchParentTimer = currentTimeMs(); // reset timer for event <MESH_EVENT_PARENT_DISCONNECTED>
            if (err != ESP_OK) {
                ESP_LOGE(TAG, "esp_mesh_set_parent Error %d  my_type: %d  my_layer: %d", err, my_type, my_layer);
            }
            selfOrganizeReactivateTimer = SELF_ORGANIZE_REACTIVATE_TIME; // start self organize reactivation timer
        }
    } else {
        ESP_LOGE(TAG, "No eligible closer Parent found");
        if (currentRSSI == NO_RSSI) { // scan again if no connection yet
            esp_mesh_set_self_organized(false, false);
            esp_wifi_scan_stop();
            scan_config.show_hidden = 1;
            scan_config.scan_type = WIFI_SCAN_TYPE_PASSIVE;
            esp_wifi_scan_start(&scan_config, 0);
        }
    }
}

@michaelsimp
Copy link
Author

michaelsimp commented Nov 25, 2024

Hi

By the way all yesterdays test and logs and today were made with your recommendation of only using
disconnected->reason == WIFI_REASON_BEACON_TIMEOUT

I have been testing getting the 4 nodes stacked up across 4 layers and powering off the NODE on layer 2 rather than the MESH_ROOT as this provide a cleaner set of logs.
Test 1 Nov 26.zip

See test 1

Layer 1 MESH_ROOT 48:ca:43:9b:53:d8
NODE A Layer 2 48:27:e2:18:39:80
NODE B Layer 3 48:ca:43:9b:54:c0
NODE C Layer 4 48:ca:43:9b:5d:20

Then power down NODE A on layer 2

NODE B switched from layer 3 to layer 2 and parent from NODE A to MESH_ROOT - perfect
NODE C stayed on layer 4 with parent 48:ca:43:9b:54:c1 which is now on layer 2, and does not show in the

Is this valid?
Node B moved from layer 3 to 2 when its parent dropped. Why did Node C not move to layer 3 ?

It stayed like this for minutes while I wrote this up

Then I powered of the MESH_ROOT see test 1 MESH_ROOT line 469. This node 48:ca:43:9b:53:d8 now becomes MESH_NODE and child of Node B.

See test 1 Node B.txt line 2038
NODE B which was on layer 2 connected to MESH_ROOT goes to MESH_IDLE with 2 children Node C and the old MESH_ROOT 48:ca:43:9b:53:d8

Remains broken like this indefinitely.

@zhangyanjiaoesp
Copy link
Collaborator

  1. Regarding the issue of the log infinite loop (aWifiMesh: <MESH_EVENT_PARENT_DISCONNECTED>reason:201), I have already explained it in my previous comment.

    It seems that in test2_2 and test2_3, there was no opportunity to switch to the self-organized network, and the device kept trying to connect to the originally configured parent, but the parent could not be detected.

    Maybe we should first investigate why the specified parent node cannot be found at this point. Is it due to a power failure, has it become idle, or is there another underlying reason?

Are you able to reproduce these problems with the ip_internal_network you modified a week or so back? I get lost in all of this and feel we would make better progress if you were able to test, analyze and debug directly.

Sorry, I can't reproduce your issue on my side.

  1. I have already discussed this with you in my previous comment: when selecting a better parent node, what criteria do you prioritize? I believe you can completely disregard the conditions in the example and instead design your own criteria based on your specific needs. First, you can move the definitions of parent_assoc and parent_record outside,
    image

then update the parent_assoc->layer when connecting to the parent.
image

Before scanning, retrieve the current parent information.
image

Finally, within the findClosestParent() function, design the criteria for selecting a better parent based on your requirements and the issues encountered during testing.

My thoughts were, I am not wanting to build the mesh network from scratch as I start with a self configured network. I am only planning to make changes to nodes with poor RSSIs. So far my tests have been successful network architecture wise (when I have a fixed ROOT so I don't get the broken mesh problem).

You mentioned that you don't want to rebuild the network from scratch, but instead, you want to adjust the initial network formed by the self-organizing process. However, during the actual testing, I've observed that you often call scan at the application layer while the initial network is still being formed, which forcibly interrupts the self-organizing process.
image

So, when you call scan at the application layer, is it completely random? Would it make sense to first check whether the initial network has been fully formed before manually triggering the scan?

I believe we must first resolve the issues mentioned in points 3 and 4 before proceeding with further problem analysis. If the initial logic framework isn't properly established, it could lead to a range of unforeseen issues down the line, which would be quite painful for me to handle.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Resolution: NA Issue resolution is unavailable Status: In Progress Work is in progress Type: Bug bugs in IDF
Projects
None yet
Development

No branches or pull requests

4 participants