Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The joint calibrates, but could not reach the startup position #536

Open
MSECode opened this issue Nov 26, 2024 · 15 comments
Open

The joint calibrates, but could not reach the startup position #536

MSECode opened this issue Nov 26, 2024 · 15 comments
Assignees
Labels

Comments

@MSECode
Copy link
Contributor

MSECode commented Nov 26, 2024

In this issue we are going to investigate the problem about the joint failing the calibration and going in timeout observed in other issues and already mentioned as the second main problem in the body of this issue #529 and collected among the major issues that we have observed in the joints that mount an AMO encoder, and that have been collected in this table: https://github.com/icub-tech-iit/study-encoders/issues/22#issuecomment-2464924451

In different occasions we have observed that, even if the joint sometimes fails to calibrate (we are not 100% sure why this is happening), it tends to still reach the hard stop limit when using calibration type 10.

DoD
Problem analyzed, understood and possible solution proposed.

@MSECode MSECode added the bugfix label Nov 26, 2024
@MSECode MSECode self-assigned this Nov 26, 2024
@MSECode
Copy link
Contributor Author

MSECode commented Nov 27, 2024

In order to study this specific behavior we have chosen to force the calibration timeout to happen. In order to do so in the setup I've calibrated the device with a very low value of PWM, i.e. ~680.
Doing so, I've noticed the following problems, which I think need a further discussion.
First of all, the most problematic thing is that the calibration is "somehow" managed by two calibration timeout variables. One is defined and managed by the higher level (i.e. icub-main), while the other is referred to the fw. The problem is that these 2 variables do not communicate between them. Therefore, I've seen, as it is clearly highlighted by the part of logs I'll report below, that a joint can be end up the calibration and be started up in RUN mode even if we got the message of "calibration timeout error". This is a problem since in this case the joint will be moved with the so called "safe PIDs" and thus, it won't perform as expected.
From my analysis that can be explained as follows:

  • if the joint takes too much for calibrating overcoming the "SW timeout" (currently configured to 30 seconds) we will get the message:
    [ERROR] threeEncodersSetupCalib : Timeout while calibrating joint 0
    and following this error the controller will be advised to use the safe PIDs as shown by the following log lines:
    [ERROR] threeEncodersSetupCalib : set 1 : Calibration went wrong! Disabling axes and keeping safe pid limit
    Moreover, if this happens the high level controller will try to force the IDL control mode to the joint as shown here:
[WARNING] embObjMotionControl::checkRemoteControlModeStatus() has done 2 attempts and will retry again after a 0.010000 sec delay. (BOARD three-encoders-setup IP 10.0.1.1, joint 0) -> current mode = cal, requested = idl
[WARNING] embObjMotionControl::checkRemoteControlModeStatus() has done 3 attempts and will retry again after a 0.010000 sec delay. (BOARD three-encoders-setup IP 10.0.1.1, joint 0) -> current mode = cal, requested = idl
[WARNING] embObjMotionControl::checkRemoteControlModeStatus() has done 4 attempts and will retry again after a 0.010000 sec delay. (BOARD three-encoders-setup IP 10.0.1.1, joint 0) -> current mode = cal, requested = idl
[WARNING] embObjMotionControl::checkRemoteControlModeStatus() has done 5 attempts and will retry again after a 0.010000 sec delay. (BOARD three-encoders-setup IP 10.0.1.1, joint 0) -> current mode = cal, requested = idl
[WARNING] embObjMotionControl::checkRemoteControlModeStatus() has done 6 attempts and will retry again after a 0.010000 sec delay. (BOARD three-encoders-setup IP 10.0.1.1, joint 0) -> current mode = cal, requested = idl
[WARNING] embObjMotionControl::checkRemoteControlModeStatus() has done 7 attempts and will retry again after a 0.010000 sec delay. (BOARD three-encoders-setup IP 10.0.1.1, joint 0) -> current mode = cal, requested = idl
[WARNING] embObjMotionControl::checkRemoteControlModeStatus() has done 8 attempts and will retry again after a 0.010000 sec delay. (BOARD three-encoders-setup IP 10.0.1.1, joint 0) -> current mode = cal, requested = idl
[WARNING] embObjMotionControl::checkRemoteControlModeStatus() has done 9 attempts and will retry again after a 0.010000 sec delay. (BOARD three-encoders-setup IP 10.0.1.1, joint 0) -> current mode = cal, requested = idl
[WARNING] embObjMotionControl::checkRemoteControlModeStatus() has done 10 attempts and will retry again after a 0.010000 sec delay. (BOARD three-encoders-setup IP 10.0.1.1, joint 0) -> current mode = cal, requested = idl
[WARNING] embObjMotionControl::checkRemoteControlModeStatus() has done 11 attempts and will retry again after a 0.010000 sec delay. (BOARD three-encoders-setup IP 10.0.1.1, joint 0) -> current mode = cal, requested = idl
[WARNING] embObjMotionControl::checkRemoteControlModeStatus() has done 12 attempts and will retry again after a 0.010000 sec delay. (BOARD three-encoders-setup IP 10.0.1.1, joint 0) -> current mode = cal, requested = idl
[WARNING] embObjMotionControl::checkRemoteControlModeStatus() has done 13 attempts and will retry again after a 0.010000 sec delay. (BOARD three-encoders-setup IP 10.0.1.1, joint 0) -> current mode = cal, requested = idl
[WARNING] embObjMotionControl::checkRemoteControlModeStatus() has done 14 attempts and will retry again after a 0.010000 sec delay. (BOARD three-encoders-setup IP 10.0.1.1, joint 0) -> current mode = cal, requested = idl
[WARNING] embObjMotionControl::checkRemoteControlModeStatus() has done 15 attempts and will retry again after a 0.010000 sec delay. (BOARD three-encoders-setup IP 10.0.1.1, joint 0) -> current mode = cal, requested = idl
[WARNING] embObjMotionControl::checkRemoteControlModeStatus() has done 16 attempts and will retry again after a 0.010000 sec delay. (BOARD three-encoders-setup IP 10.0.1.1, joint 0) -> current mode = cal, requested = idl
[DEBUG] from BOARD 10.0.1.1 (three-encoders-setup) time=44s 393m 177u :  src LOCAL, adr 0,(code 0x04000001, par16 0x0001 par64 0x0000000000003c2f) -> DEBUG: tag01  Position value in par64 e ed encoder pos is par1
[DEBUG] from BOARD 10.0.1.1 (three-encoders-setup) time=44s 393m 292u :  src LOCAL, adr 0,(code 0x04000001, par16 0x0000 par64 0x0000000000003c2f) -> DEBUG: tag01  s_eo_motioncontrol_updatedPositionsFromEncoders
[DEBUG] from BOARD 10.0.1.1 (three-encoders-setup) time=44s 393m 506u :  src LOCAL, adr 0,(code 0x04000001, par16 0x0000 par64 0x0000000000000000) -> DEBUG: tag01  mul:-1, dist:-902097, div:64.0000, zero:0
[DEBUG] from BOARD 10.0.1.1 (three-encoders-setup) time=44s 393m 611u :  src LOCAL, adr 0,(code 0x04000001, par16 0x0000 par64 0x000000000000370f) -> DEBUG: tag01  JointSet_do_odometry
[ERROR] A 0.250000 sec timeout occured in embObjMotionControl::checkRemoteControlModeStatus(), BOARD three-encoders-setup IP 10.0.1.1, joint 0, current mode: cal, requested: idl
[ERROR] failure of embObjMotionControl::checkRemoteControlModeStatus(j=0, targetmode=idl) for BOARD three-encoders-setup IP 10.0.1.1 after 16 attempts and 0.252670 seconds
[ERROR] In embObjMotionControl::setControlModeRaw(j= 0 , mode= idl ) for  BOARD three-encoders-setup (IP 10.0.1.1)   has failed checkRemoteControlModeStatus()

Thus this is what happens from a sw level point of view.

  • Here, I'd like to point out a first problem. Even if that happens but the lower level calibration timeout does not expire, the joint continues to move, as one can see from the screen of the logs I'll attach just below:

Image

As you can see, the position is still changing also after the error message of the calibration timeout has been sent to the board and the controller asked, failing ([ERROR] In embObjMotionControl::setControlModeRaw(j= 0 , mode= idl ) for BOARD three-encoders-setup (IP 10.0.1.1) has failed checkRemoteControlModeStatus()) for the IDL control mode.

This behavior is also clear if we take a look to the graph of the data dumped from the joint. In fact, it is possible to see that the robot gets calibrated and set is RUN mode. Moreover, when shutting down the yri the robots goes to the parking position

Image

Image

  • Instead, if both timers get expired while calibrating the joint remains in NOT CONIFIGURED (as one expects) and the calibration fails as we used to think. This is visible from the following graphs, where one can see that the input trajectory remains to zero:

Image

as it is clear form the graph, in this case the calibration did not finished because the timer expired before reaching the hard stop limit.
If, instead, the limit is reached before the fw timeout expires the joint is wrongly moved to RUN mode.

cc: @valegagge

@MSECode
Copy link
Contributor Author

MSECode commented Dec 2, 2024

Since PR opened I'll move to review

@valegagge
Copy link
Member

Hi @MSECode ,
Please report here if and how you fixed the problem. Thanks!

@MSECode
Copy link
Contributor Author

MSECode commented Dec 3, 2024

Considering what said and seen, we have understood that the problem is mainly related to that fact that --> when the icub-main CALIBRATION_TIMEOUT gets expired the request for the CONTROL_MODE IDLE is requested. Here the problem is that on a fw level, precisely here:

if (control_mode_cmd != eomc_controlmode_cmd_force_idle)
{
#ifdef WRIST_MK2
if(eomc_jsetconstraint_ergocubwrist != o->special_constraint)
{
if ((eOmc_controlmode_t)control_mode_cmd == o->control_mode) return TRUE;
}
#endif
if (o->control_mode == eomc_controlmode_calib) return FALSE;
, we are going to force the calibration to stop just when we receive a control mode of type FORCE_IDLE. Therefore, in the corner case described above, where we have the icub-main calibration timeout expired but not the controller calibration timeout, we ask the robot to go in IDLE but that is not enough for the robot for stopping the calibration procedure.
Therefore to (temporarily) solve this we need to decide if it is preferred to make the parametricCalibratorEth asking for the FORCE_IDLE control mode instead of just IDLE at these lines: https://github.com/robotology/icub-main/blob/master/src/libraries/icubmod/parametricCalibratorEth/parametricCalibratorEth.cpp#L762-L768, or we need to change the logic in the fw side where we need to stop the calibration even when control mode IDLE is received.
What do u think: @valegagge @ale-git @pattacini

On top of this I would add that I think that the core problem of this is related to the fact that during a calibration procedure that is managed both from icub-main part and icub-firmware part we are using 2 different timeouts that:

  • First of all they are not communicating, thus they start and obviously end at different times
  • Secondly they are managing the same thing, so it does not make much sense to use two different variables that are managing (actually more or less) the same thing. I'm saying more or less because the calibration pipeline as intended by icub-main is different from the one intended by the fw.

Anyway, where do you think to add the fix? I would propose to make icub-main asking for the FORCE_IDLE mode instead of IDLE but I do not know if that is a modification that can break other part of the fw because of related to an old architectural choice. I'm asking that since that part of the code has been remained untouched since years.

@pattacini
Copy link
Member

Hi @MSECode

Good that you narrowed down the root cause of the problem 👍🏻

A quick but well-tested fix (to rule out side effects), like the one you mentioned, is fine.

However, I suggest getting an overall view of the underlying state machine, essentially to master the reason why we implemented the two timeouts in the first place.

@ale-git
Copy link
Contributor

ale-git commented Dec 3, 2024

Hi @pattacini, the yarprobotinterface calibration timeout is referred not to the calibration itself (the name is slightly misleading), but to the reaching of the startup position once the calibration is done, performed by a positionMove command: if the joint can't reach the startup position, it is supposed that something went wrong and the joint is set in idle and the calibration output lower pwm limit is maintained. The board calibration timeout is referred instead to the encoder calibration itself. For example, in the type 10 calibration, the joint returns in unconfigured state if the hard stop has not been detected for some reason before the timeout.
In the calibration 10 example, the board calibration timeout occurs if the hard stop hasn't been reached. After succesful encoder calibration, a positionMove towards sterup position is performed, and if this position can't be reached in the given amount of time with desired precision, a yarprobotinterface "calibration" warning is launched (joint in idle, low pwm limit).

@pattacini
Copy link
Member

What @ale-git wrote above should clarify why there are two different timeouts 👍🏻

If I got it right, the high-level timeout should start being active only upon issuing the positionMove, which in turn should be called only when the calibration is done. Are the two operations serialized?

@MSECode I'm no longer sure that your quick fix can have the green light.
Better off getting aligned with @ale-git on this.

@MSECode
Copy link
Contributor Author

MSECode commented Dec 3, 2024

Yes,what @ale-git said is totally correct. The issue I've seen here, which I think is just happening with the calibration 10, it's due to the fact that, if the joint reaches the hard stop position after the high level timeout expires but while the high level timeout is still going, the joint then is moved to the startup position and set in RUN but it will be configured with the so called safe pids, but the user has not a clear vision of this since the joint will appear as correctly calibrated (this is true if it does not look to the yri logs)

@valegagge
Copy link
Member

valegagge commented Dec 3, 2024

Hi guys,
what I saw from the code in the calibrator in icub-main in that there are two timeouts: one used to check the calibration and one used to check the zero position reaching.

Image

Here the link to the code
They are used here:

@pattacini
Copy link
Member

pattacini commented Dec 4, 2024

if the joint reaches the hard stop position after the high level timeout expires but while the high level timeout is still going

Not clear to me...

Frankly, as a general comment, I feel the need for a diagram depicting all the components at stake as well as their interactions. This diagram should also serve as documentation to get everyone aligned. Is there already something available in this respect?

Then, timeouts are weak patterns that we tend to use to keep things simple. In this situation, if there are 2 timeouts, 1 at low level (calibration) and 1 at high level (homing), each checking two different operations taking place serially, I would say that there's some robustness.

Instead, if there are 3 timeouts as @valegagge suggests (did I get it right?), 1 at low level (icub-firmware:calibration) and 2 at high level (icub-main:calibration and icub-main:homing), whose 1 checks what is being checked already by another piece of code elsewhere, well, things start getting quite weak...

@valegagge
Copy link
Member

Hi @pattacini , this is a mess!
as soon as possible I'll provide a state machine diagram to explain the behavior better. 💪

@valegagge
Copy link
Member

I created the fisrt draft of the state machine of the ParametricCalibratorETh strictly related to the current implementation.

This is the first step toward the sw architecture definition of the calibrator.
Image

I hope this diagram can help @MSECode to understand better what is going on.

I'm available to discuss any questions and improve the diagram itself.

@MSECode
Copy link
Contributor Author

MSECode commented Dec 17, 2024

After discussion and analysis on the bench setup w/ @valegagge we have observed the following:

Behavior 1 --> calibrator on icub-main requesting IDLE mode when timeout expires --> current behavior --> https://github.com/robotology/icub-main/blob/master/src/libraries/icubmod/parametricCalibratorEth/parametricCalibratorEth.cpp#L697-L702

When the calibration timeout set on icub-main (hardcoded to 20 sec) expires the IDLE control mode is requested to the low level calibrator. This is done correctly but the calibrator on the low level continues to calibrate since this request is not taken into account by the calibrator when the joint is in calibration mode:
See here:
Image

If current control mode, i.e. o->control_mode, where "o" is the JointSet object, is equal to eomc_controlmode_calib we return FALSE

This implies that the joint continues to move, as said here: #536 (comment) and the output can be of 2 types:

  1. joint reaches the hard-stop limit before the low level timeout expires. If this happens the joint is set first in IDLE and just after that in POSITION MODE. This is shown here in the JointSet_do_wait_calibration:

Image
This is the most common but maybe worst behavior, since the joint can be moved by a user if the motorgui is opened. Moreover, because of the calibration did not complete on the high level the joint is moved with the safe pids. Furthermore, due to the same consequence, the joint did not moved to the startup position but it remains steady at the hard-stop position until it is requested to be parked when the yri gets closed.

  1. joint does not reach the hard-stop position before the low level timeout expires. In this case the joint will be moved to the NOT_CONIFGURED status. This is a safe status, even though we are required to restart all the calibration procedure, since in this case the joint cannot be moved anywhere. However, this is though to happen since waiting for the timeout to end (30 sec) the joint should calibrate really slow or not be able to move.

Behavior 2 --> calibrator on icub-main requesting FORCE_IDLE mode when timeout expires --> tested behavior

When the calibration timeout set on icub-main (hardcoded to 20 sec) expires the FORCE_IDLE control mode is requested to the low level calibrator. This is done correctly and the calibrator on the low level stops to calibrate since this request is taken into account by the calibrator when the joint is in calibration mode:
See here:
Image

However, even if this happens we can see that inside the method that should actually set the control mode, i.e. Joint_set_control_mode, this happens:
Image

If the joint is calibrating we just RETURN, thus nothing happens.
However, since the motor is set in IDLE we could see that the joint stopped where it was.
One problem we observed here is that, due to what described, the joint thinks to be calibrated since it does not move anymore but from the point of view of the calibrator we are still calibrating (we just have set to IDLE the motor, not the joint). Therefore, the final behavior is the same of the previous case.
The joint gets set in POSITION MODE. It stays still where it is but it can be moved using the motorgui or whatever method. At closure it goes to the parking position.
Moreover, another problem here to mention is that in this case the joint did not reached the hard-stop position (it stops when motor is set to IDLE). Thus, the zero position is different from the one we have requested through the calibration parameters.

Here some logs I've taken using the debugger and keeping the trace open. I've focused on printing the different control modes that gets requested and set:

behavior 1

behavior-1.log

behavior 2

behavior-2.log

These are the control mode value (On the logs you have the numbers not string)
Image

Together w/ @valegagge we have though to address the whole issue starting from rethinking to both calibrator since this seems to be more a problem of the architecture that has been probably changes so much in different times that now is such compromised of multiple corner cases that simple adjustment are not enough anymore but they risk to make both code management and the current calibration pipeline much worse.
In fact, the behaviors described have a low probability to happen (as things are now on the robot) and thus this problem can be addressed after the vacations by doing a proper re-construction of the calibrators.

cc: @pattacini, @ale-git

@pattacini
Copy link
Member

Unfortunately, I have really no time these days for an in-depth reading.
I concur that the operation is somewhat complicated and kind of borderline but most of all I agree with you to go for a deep overhaul of the component.
The time has come to see a Stateflow machine in action, finally 😉

@valegagge
Copy link
Member

valegagge commented Dec 20, 2024

I created a new repo to address the whole activity on the calibrator both high-level and embedded.

I don't want to close this issue for now, because we need to understand how to address its fix (This bug is very annoying during the robot usage).

So, for now, I report it in a list of bugs we need to address. https://github.com/icub-tech-iit/study-calibrator/issues/1

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants