Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pr2_computer_monitor reports "Stale" for CPU Temp (ros ticket #4171) #152

Open
ahendrix opened this issue Mar 12, 2013 · 9 comments
Open

Comments

@ahendrix
Copy link
Member

It looks like the thread that calls IPMI tool and checks the clock speeds isn't working. The subprocesses are probably hung somehow, and this may be some kind of weird race condition with ipmitool.

Probably the best solution is to have a timeout of some kind on the ipmitool call.

trac data:

@ahendrix
Copy link
Member Author

[watts] The other possibility is that there is an uncaught exception. I should add a "rospy.logerr" to the exception catch in the check_ipmitool() function. I could at least get a log of the data.

@ahendrix
Copy link
Member Author

[watts] One way to fix this could be to terminate the thread that is calling ipmitool and re-spawn it. This could happen if the update takes longer than 60 seconds or something.

I can try to check if the temp stat call is stale, and run another timer for it.

The temp stat timer thread in cpu_monitor.py checks ipmitool and mpstat, it is possible for either call to be malfunctioning.

@ahendrix
Copy link
Member Author

[watts] r42391 has some fixes that may help this.

@ahendrix
Copy link
Member Author

[watts] Will try to restart the thread if temperature checking goes stale, r42393

@ahendrix
Copy link
Member Author

[watts] I ran this program on PRF and PRG for over 100 hours. During this time, I did not observe this problem, even with stressing the computers. The fixes above should capture the condition if it occurs, and report it.

In the meantime, resolving this until we have this again.

@ahendrix
Copy link
Member Author

[watts] Looks like this is still a problem. The temperature thread can restart, but it looks like it restarts improperly. I'm running experiments to investigate.

r44694 - Correct name of temperature thread.

@ahendrix
Copy link
Member Author

[watts] r44732 - Temp thread now restarts correctly. Was deadlocking on restart.

@ahendrix
Copy link
Member Author

[watts] It looks like there is a race condition in this restart that can cause the node to deadlock. The existing threads are not being terminated.

To fix this, I should change the threading model of cpu_monitor.py to move away from "Timers".

@ahendrix
Copy link
Member Author

[watts] After talking with Nathan, it looks like the root cause of this is that the call to "ipmitool sdr" doesn't return sometimes. Unfortunately, we still have to fix this problem in pr2_computer_monitor.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant