-
Notifications
You must be signed in to change notification settings - Fork 54
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
pr2_computer_monitor reports "Stale" for CPU Temp (ros ticket #4171) #152
Comments
[watts] The other possibility is that there is an uncaught exception. I should add a "rospy.logerr" to the exception catch in the check_ipmitool() function. I could at least get a log of the data. |
[watts] One way to fix this could be to terminate the thread that is calling ipmitool and re-spawn it. This could happen if the update takes longer than 60 seconds or something. I can try to check if the temp stat call is stale, and run another timer for it. The temp stat timer thread in cpu_monitor.py checks ipmitool and mpstat, it is possible for either call to be malfunctioning. |
[watts] r42391 has some fixes that may help this. |
[watts] Will try to restart the thread if temperature checking goes stale, r42393 |
[watts] I ran this program on PRF and PRG for over 100 hours. During this time, I did not observe this problem, even with stressing the computers. The fixes above should capture the condition if it occurs, and report it. In the meantime, resolving this until we have this again. |
[watts] Looks like this is still a problem. The temperature thread can restart, but it looks like it restarts improperly. I'm running experiments to investigate. r44694 - Correct name of temperature thread. |
[watts] r44732 - Temp thread now restarts correctly. Was deadlocking on restart. |
[watts] It looks like there is a race condition in this restart that can cause the node to deadlock. The existing threads are not being terminated. To fix this, I should change the threading model of cpu_monitor.py to move away from "Timers". |
[watts] After talking with Nathan, it looks like the root cause of this is that the call to "ipmitool sdr" doesn't return sometimes. Unfortunately, we still have to fix this problem in pr2_computer_monitor. |
It looks like the thread that calls IPMI tool and checks the clock speeds isn't working. The subprocesses are probably hung somehow, and this may be some kind of weird race condition with ipmitool.
Probably the best solution is to have a timeout of some kind on the ipmitool call.
trac data:
The text was updated successfully, but these errors were encountered: