You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
{{ message }}
This repository has been archived by the owner on Jul 22, 2024. It is now read-only.
Summary:
When a node crash occurs, the error that is returned by CSM to LSF results in a message in bhist -l that looks like this:
[DATE]: External Message "csm_allocation_delete returned CSMERR_TIMEOUT" was posted from "root" to messag box 0;
Describe the solution you'd like
We need something better, like: External Message "Compute node sierra4105 failed"
Issue Source:
We have been getting a lot of complaints from users about this message, and it requires a lot of manual work by our staff to provide them a meaningful answer. If the big data solution was really working, it would at least remove the manual work related to this problem... but really, the user visible message should be better, calling out a node failure, and ideally which node. Our other scheduler has done this for 10 years.
The text was updated successfully, but these errors were encountered:
I believe that CSM should be generating an error message with the relevant data for bad nodes. The CSMERR_TIMEOUTshould just be a general error path for when nodes failed to respond. The message definitely should have more illustrative data.
@fpizzano We should talk to someone on the LSF team to bubble this out.
Summary:
When a node crash occurs, the error that is returned by CSM to LSF results in a message in bhist -l that looks like this:
[DATE]: External Message "csm_allocation_delete returned CSMERR_TIMEOUT" was posted from "root" to messag box 0;
Describe the solution you'd like
We need something better, like: External Message "Compute node sierra4105 failed"
Issue Source:
We have been getting a lot of complaints from users about this message, and it requires a lot of manual work by our staff to provide them a meaningful answer. If the big data solution was really working, it would at least remove the manual work related to this problem... but really, the user visible message should be better, calling out a node failure, and ideally which node. Our other scheduler has done this for 10 years.
The text was updated successfully, but these errors were encountered: