Revamp Callback Groups #92

amalnanavati · 2023-09-15T17:06:09Z

Description

Recommended Reading: ROS2 tutorial on Executors, ROS2 tutorial on Callback Groups.

In service of #62 . Paired with ada_ros2#20.

Before PR #90 , all callbacks were in the same mutually exclusive callback group, which means only one could be running at a time. This sometimes resulted in watchdog messages being dropped on lovelace while a service call was active (Issue #63 ). PR #90 changed this to put all callbacks into a Reentrant Callback Group. However, this had the following issues:

Too much CPU load. This is documented in the ROS2 tutorials: "the executor overhead in terms of CPU and memory usage is considerable.". In practice, on both t0b1 and lovelace, due to the high CPU load the controller (which is running on the same computer) sometimes resulted in very jerky movements.
Missed messages. Strangely enough, having everything in one ReentrantCallbackGroup still resulted in the watchdog messages sometimes being missed. However, I think this time it is due to a different reason -- some callback in the callback group fires very frequently, and therefore uses up all the threads, so there is no longer a thread available for the watchdog. This is documented in the ROS2 tutorials: "Higher priority callbacks may be blocked by lower priority callbacks."
Callback queue filling up. If a callback takes quite long (e.g., face detection prior to Move Face Detection out of callbacks #91 ), then the callback queue in RMW would fill up, which I'd imagine negatively impacts CPU load. "if the processing time of some callbacks is longer, messages and events will be queued on the lower layers of the stack"
Callbacks being called in parallel with themselves. Callbacks in a reentrant group can be called in parallel with themselves, which is not always desirable behavior. For example, we may not want to run two instances of face detection on two separate images -- we only care about the newest one.

This PR addresses the above issues by revamping callback groups in the following way:

Use ReentrantGroups sparingly. If a MultiThreadedExecutor has even one ReentrantCallbackGroup in it, it is possible that that callback group eats up all the threads, and none are left for any other callback group. Instead, opting for multiple MutuallyExclusiveCallbackGroups is better, because we can anticipate how many threads the Executor needs, and ensure that one is always available for each callback group.
1. The only reason to use a ReentrnatCallbackGroup is if the callback should be called in parallel with itself. If it can be called in parallel with itself but doesn't get a benefit from it, instead opt for it to have its own MutuallyExclusiveCallbackGroup. If it should be called in parallel with other callbacks (a common case), put the callbacks in separate MutuallyExclusiveCallbackGroups.
Time-critical callbacks that should not be blocked by anything (e.g., watchdog messages) should be in their own callback group (typically their own MutuallyExclusiveCallbackGroup).
For all other callbacks, the main question is whether to put them into a separate MutuallyExclusiveCallbackGroup, or a MutuallyExclusiveCallbackGroup that is shared with other callbacks. For that decision, determine whether the callbacks should be called in parallel with each other or not.
1. For instance, in forque_sensor_hardware, the re-taring service and the timer to read from the FT sensor should be called in parallel with each other, so that FT readings continue to be read even while the service is active. Therefore, those callbacks should be put into separate mutually exclusive callback groups.
2. On the other hand, in many of the MoveTo trees (e.g., MoveToMouth), the subscriptions and service calls are only ever called sequentially relative to each other. Hence, they can and should be placed into the same MutuallyExclusiveCallbackGroup.

Testing procedure

On both t0b1 and lovelace, run all actions, at least twice each, and ensure that there is no wierdness in having callbacks get called.

Before opening a pull request

Format your code using black formatter python3 -m black .
Run your code through pylint and address all warnings/errors. The only warnings that are acceptable to not address is TODOs that should be addressed in a future PR. From the top-level ada_feeding directory, run: pylint --recursive=y --rcfile=.pylintrc ..

Before Merging

Squash & Merge

amalnanavati · 2023-09-15T17:24:04Z

I noticed the following issues when testing on lovelace:

The feedback message from the action server updated infrequently, even though the robot was moving. This is likely because the MoveTo behavior has its own joint state subscription, when instead it should use MoveIt2's joint state subscription. (See issue [ROS2] MoveTo Behavior should use MoveIt2's JointState Subscriber #88 ).
Sometimes the behavior tree will make a service call, the service call will be accepted and processed by the server, but it will take very long for the behavior tree to receive the response. I noticed this at the end of MoveToMouth, when the behavior tree was toggling off face detection, and it took ~50 secs after the server sent the response for the behavior tree to receive it. I have the following hypotheses for why this is happening:
1. Subscribers in the same MutuallyExclusiveCallbackGroup are blocking the service result from being received. This is a feasible hypothesis because topics are taken off of the wait set before service calls. In MoveToMouth, the two subscribers that could be causing this are the check_face behavior (subscribing the /face_detection and the TF listener in compute_mouth_pose. The former is less likely because the service call has already been processed, so that publisher is no longer publishing.
  1. This brings up the issue that as of now, all subscribers persist even when the tree is not being called, and therefore continue to take up cycles of the callback group. The only ways to address this are either: (a) subscribe in initialise and destroy the subscriber in terminate; or (b) subscribe in setup (already done), destroy the subscriber in shutdown (not yet done), and re-setup/shutdown the tree after every action call.
2. Callbacks in MoveIt2's ReentrantCallbackGroup are eating up all the threads in our MultiThreadedExecutor. This is possible because: (a) we have multiple MoveIt2 objects (see [ROS2] Global MoveIt2 Object #60 ); and (b) each MoveIt2 object has subscribers like /joint_state that may eat up resources. To address this, we should address [ROS2] Global MoveIt2 Object #60 , and also determine whether MoveIt2 truly needs the ReentrantCallbackGroup or not.

Note that I no longer noticed dropped watchdog messages or jerky/slow controller, so we are moving in the right direction :)

…lusive

amalnanavati · 2023-09-16T02:56:58Z

Issue 2.ii should be solved with #94. I still need to investigate to recreate and resolve issue 1 and issue 2.i. Further testing will be done on branches that build upon this (particularly amaln/improve_bite_transfer), but waits to address issues will be documented and pushed to this PR.

amalnanavati · 2023-09-18T23:31:29Z

The issues observed above, both 1 and 2, no longer exist with #94 . Tested MoveToMouth/MoveFromMouth several times on lovelace and all times ran smoothly.

amalnanavati · 2023-09-18T23:47:38Z

Tested on t0b1 as well and it works.

I did test it with screen capture going on in parallel, and we still see jerky controller movements. That is documented in #95.

This was referenced Sep 15, 2023

[ROS2] Revamp Callback Groups (and/or Executors) #62

Closed

[ROS2] Watchdog trips on LoveLace during service calls #63

Closed

[ROS2] Revamp QoS Settings #93

Open

Base automatically changed from amaln/face_detection_revamp to ros2-devel September 15, 2023 17:32

amalnanavati added 6 commits September 15, 2023 10:37

Put MoveIt2 objects into their own ReentrantCallbackGroup

2bedbbf

Revert create_action_servers.py default callback group to MutuallyExc…

b99682c

…lusive

Updated watchdog callback groups

e3c0748

Updated face detection callback groups

4084552

Updated food detection callback groups

d170560

Updated republisher callback groups

42da1a5

amalnanavati force-pushed the amaln/callback_groups_revamp branch from 59ad4cf to 42da1a5 Compare September 15, 2023 17:38

amalnanavati mentioned this pull request Sep 16, 2023

Put Watchdog callbacks in separate MutuallyExclusiveCallbackGroups personalrobotics/ada_ros2#20

Merged

amalnanavati merged commit 4cc3acc into ros2-devel Sep 18, 2023

amalnanavati deleted the amaln/callback_groups_revamp branch September 18, 2023 23:47

amalnanavati mentioned this pull request Sep 19, 2023

Improve Bite Transfer #72

Merged

8 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Revamp Callback Groups #92

Revamp Callback Groups #92

amalnanavati commented Sep 15, 2023 •

edited

Loading

amalnanavati commented Sep 15, 2023

amalnanavati commented Sep 16, 2023

amalnanavati commented Sep 18, 2023

amalnanavati commented Sep 18, 2023

Revamp Callback Groups #92

Revamp Callback Groups #92

Conversation

amalnanavati commented Sep 15, 2023 • edited Loading

Description

Testing procedure

Before opening a pull request

Before Merging

amalnanavati commented Sep 15, 2023

amalnanavati commented Sep 16, 2023

amalnanavati commented Sep 18, 2023

amalnanavati commented Sep 18, 2023

amalnanavati commented Sep 15, 2023 •

edited

Loading