Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Revamp Callback Groups #92

Merged
merged 6 commits into from
Sep 18, 2023
Merged

Conversation

amalnanavati
Copy link
Contributor

@amalnanavati amalnanavati commented Sep 15, 2023

Description

Recommended Reading: ROS2 tutorial on Executors, ROS2 tutorial on Callback Groups.

In service of #62 . Paired with ada_ros2#20.

Before PR #90 , all callbacks were in the same mutually exclusive callback group, which means only one could be running at a time. This sometimes resulted in watchdog messages being dropped on lovelace while a service call was active (Issue #63 ). PR #90 changed this to put all callbacks into a Reentrant Callback Group. However, this had the following issues:

  1. Too much CPU load. This is documented in the ROS2 tutorials: "the executor overhead in terms of CPU and memory usage is considerable.". In practice, on both t0b1 and lovelace, due to the high CPU load the controller (which is running on the same computer) sometimes resulted in very jerky movements.
  2. Missed messages. Strangely enough, having everything in one ReentrantCallbackGroup still resulted in the watchdog messages sometimes being missed. However, I think this time it is due to a different reason -- some callback in the callback group fires very frequently, and therefore uses up all the threads, so there is no longer a thread available for the watchdog. This is documented in the ROS2 tutorials: "Higher priority callbacks may be blocked by lower priority callbacks."
  3. Callback queue filling up. If a callback takes quite long (e.g., face detection prior to Move Face Detection out of callbacks #91 ), then the callback queue in RMW would fill up, which I'd imagine negatively impacts CPU load. "if the processing time of some callbacks is longer, messages and events will be queued on the lower layers of the stack"
  4. Callbacks being called in parallel with themselves. Callbacks in a reentrant group can be called in parallel with themselves, which is not always desirable behavior. For example, we may not want to run two instances of face detection on two separate images -- we only care about the newest one.

This PR addresses the above issues by revamping callback groups in the following way:

  1. Use ReentrantGroups sparingly. If a MultiThreadedExecutor has even one ReentrantCallbackGroup in it, it is possible that that callback group eats up all the threads, and none are left for any other callback group. Instead, opting for multiple MutuallyExclusiveCallbackGroups is better, because we can anticipate how many threads the Executor needs, and ensure that one is always available for each callback group.
    1. The only reason to use a ReentrnatCallbackGroup is if the callback should be called in parallel with itself. If it can be called in parallel with itself but doesn't get a benefit from it, instead opt for it to have its own MutuallyExclusiveCallbackGroup. If it should be called in parallel with other callbacks (a common case), put the callbacks in separate MutuallyExclusiveCallbackGroups.
  2. Time-critical callbacks that should not be blocked by anything (e.g., watchdog messages) should be in their own callback group (typically their own MutuallyExclusiveCallbackGroup).
  3. For all other callbacks, the main question is whether to put them into a separate MutuallyExclusiveCallbackGroup, or a MutuallyExclusiveCallbackGroup that is shared with other callbacks. For that decision, determine whether the callbacks should be called in parallel with each other or not.
    1. For instance, in forque_sensor_hardware, the re-taring service and the timer to read from the FT sensor should be called in parallel with each other, so that FT readings continue to be read even while the service is active. Therefore, those callbacks should be put into separate mutually exclusive callback groups.
    2. On the other hand, in many of the MoveTo trees (e.g., MoveToMouth), the subscriptions and service calls are only ever called sequentially relative to each other. Hence, they can and should be placed into the same MutuallyExclusiveCallbackGroup.

Testing procedure

  • On both t0b1 and lovelace, run all actions, at least twice each, and ensure that there is no wierdness in having callbacks get called.

Before opening a pull request

  • Format your code using black formatter python3 -m black .
  • Run your code through pylint and address all warnings/errors. The only warnings that are acceptable to not address is TODOs that should be addressed in a future PR. From the top-level ada_feeding directory, run: pylint --recursive=y --rcfile=.pylintrc ..

Before Merging

  • Squash & Merge

@amalnanavati
Copy link
Contributor Author

I noticed the following issues when testing on lovelace:

  1. The feedback message from the action server updated infrequently, even though the robot was moving. This is likely because the MoveTo behavior has its own joint state subscription, when instead it should use MoveIt2's joint state subscription. (See issue [ROS2] MoveTo Behavior should use MoveIt2's JointState Subscriber #88 ).
  2. Sometimes the behavior tree will make a service call, the service call will be accepted and processed by the server, but it will take very long for the behavior tree to receive the response. I noticed this at the end of MoveToMouth, when the behavior tree was toggling off face detection, and it took ~50 secs after the server sent the response for the behavior tree to receive it. I have the following hypotheses for why this is happening:
    1. Subscribers in the same MutuallyExclusiveCallbackGroup are blocking the service result from being received. This is a feasible hypothesis because topics are taken off of the wait set before service calls. In MoveToMouth, the two subscribers that could be causing this are the check_face behavior (subscribing the /face_detection and the TF listener in compute_mouth_pose. The former is less likely because the service call has already been processed, so that publisher is no longer publishing.
      1. This brings up the issue that as of now, all subscribers persist even when the tree is not being called, and therefore continue to take up cycles of the callback group. The only ways to address this are either: (a) subscribe in initialise and destroy the subscriber in terminate; or (b) subscribe in setup (already done), destroy the subscriber in shutdown (not yet done), and re-setup/shutdown the tree after every action call.
    2. Callbacks in MoveIt2's ReentrantCallbackGroup are eating up all the threads in our MultiThreadedExecutor. This is possible because: (a) we have multiple MoveIt2 objects (see [ROS2] Global MoveIt2 Object #60 ); and (b) each MoveIt2 object has subscribers like /joint_state that may eat up resources. To address this, we should address [ROS2] Global MoveIt2 Object #60 , and also determine whether MoveIt2 truly needs the ReentrantCallbackGroup or not.

Note that I no longer noticed dropped watchdog messages or jerky/slow controller, so we are moving in the right direction :)

Base automatically changed from amaln/face_detection_revamp to ros2-devel September 15, 2023 17:32
@amalnanavati amalnanavati force-pushed the amaln/callback_groups_revamp branch from 59ad4cf to 42da1a5 Compare September 15, 2023 17:38
@amalnanavati
Copy link
Contributor Author

Issue 2.ii should be solved with #94. I still need to investigate to recreate and resolve issue 1 and issue 2.i. Further testing will be done on branches that build upon this (particularly amaln/improve_bite_transfer), but waits to address issues will be documented and pushed to this PR.

@amalnanavati
Copy link
Contributor Author

The issues observed above, both 1 and 2, no longer exist with #94 . Tested MoveToMouth/MoveFromMouth several times on lovelace and all times ran smoothly.

@amalnanavati
Copy link
Contributor Author

Tested on t0b1 as well and it works.

I did test it with screen capture going on in parallel, and we still see jerky controller movements. That is documented in #95.

@amalnanavati amalnanavati merged commit 4cc3acc into ros2-devel Sep 18, 2023
@amalnanavati amalnanavati deleted the amaln/callback_groups_revamp branch September 18, 2023 23:47
@amalnanavati amalnanavati mentioned this pull request Sep 19, 2023
8 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant