CASMCMS-8754: Make BOS V2 status operator resilient to power errors #222
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Summary and Scope
When CAPMC returns errors, handle them and disable the associated nodes instead of panicking and halting all forward progress.
Added some new classes.
Handle troublesome CAPMC errors that don't point fingers and indentify which nodes experienced the errors.
If CAPMC returns an error that cannot be associated with an individual node, then reissue the CAPMC command for each individual component. Under that condition, Any components returning an error is guaranteed to be the cause of the error, so they are associated with the error and disabled in order to not cause BOS to attempt to retry them.
As a consequence of handling CAPMC-returned errors, the power operators need to be able to disable nodes. When they receive errors from CAPMC, they can declare those nodes disabled.
Create a power_operator_base class because error handling is common to all three power operators. This is an abstract base class that each of the power operators inherits from. It collects all of the error handling into one place, so that it is not spread across the three power operators.
Summarize what has changed. Explain why this PR is necessary. What is impacted? Is this a new feature, critical bug fix, etc?
Is this change backwards incompatible, backwards compatible, or a backwards compatible bugfix?
Issues and Related PRs
List and characterize relationship to Jira/Github issues and other pull requests. Be sure to list dependencies.
<insert branch name here>
<insert PR URL here>
Testing
List the environments in which these changes were tested.
Tested on:
Gamora
, a 1.4.X systemTest description:
I'll e-mail out my test descriptions to any reviewers.
How were the changes tested and success verified? If schema changes were part of this change, how were those handled in your upgrade/downgrade testing?
Risks and Mitigations
Medium
Pull Request Checklist