Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CASMCMS-8754: Make BOS V2 status operator resilient to power errors #222

Merged
merged 1 commit into from
Oct 3, 2023

Conversation

jsollom-hpe
Copy link
Contributor

@jsollom-hpe jsollom-hpe commented Oct 3, 2023

Summary and Scope

When CAPMC returns errors, handle them and disable the associated nodes instead of panicking and halting all forward progress.

Added some new classes.

Handle troublesome CAPMC errors that don't point fingers and indentify which nodes experienced the errors.

If CAPMC returns an error that cannot be associated with an individual node, then reissue the CAPMC command for each individual component. Under that condition, Any components returning an error is guaranteed to be the cause of the error, so they are associated with the error and disabled in order to not cause BOS to attempt to retry them.

As a consequence of handling CAPMC-returned errors, the power operators need to be able to disable nodes. When they receive errors from CAPMC, they can declare those nodes disabled.

Create a power_operator_base class because error handling is common to all three power operators. This is an abstract base class that each of the power operators inherits from. It collects all of the error handling into one place, so that it is not spread across the three power operators.

Summarize what has changed. Explain why this PR is necessary. What is impacted? Is this a new feature, critical bug fix, etc?

Is this change backwards incompatible, backwards compatible, or a backwards compatible bugfix?

Issues and Related PRs

List and characterize relationship to Jira/Github issues and other pull requests. Be sure to list dependencies.

  • Resolves [CASMCMS-8754](issue link)
  • Change will also be needed in <insert branch name here>
  • Future work required by [issue id](issue link)
  • Documentation changes required in [issue id](issue link)
  • Merge with/before/after <insert PR URL here>

Testing

List the environments in which these changes were tested.

Tested on:

  • Gamora, a 1.4.X system
  • Virtual Shasta

Test description:

I'll e-mail out my test descriptions to any reviewers.

How were the changes tested and success verified? If schema changes were part of this change, how were those handled in your upgrade/downgrade testing?

  • Were the install/upgrade-based validation checks/tests run (goss tests/install-validation doc)?
  • Were continuous integration tests run? If not, why?
  • Was upgrade tested? If not, why?
  • Was downgrade tested? If not, why?
  • Were new tests (or test issues/Jiras) created for this change?

Risks and Mitigations

Medium

Pull Request Checklist

  • Version number(s) incremented, if applicable
  • Copyrights updated
  • License file intact
  • Target branch correct
  • CHANGELOG.md updated
  • Testing is appropriate and complete, if applicable
  • HPC Product Announcement prepared, if applicable

@jsollom-hpe jsollom-hpe requested a review from a team as a code owner October 3, 2023 16:55
Copy link
Contributor

@rbak-hpe rbak-hpe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A couple of minor things, but otherwise it looks good.

src/bos/operators/power_on.py Show resolved Hide resolved
src/bos/operators/base.py Outdated Show resolved Hide resolved
When CAPMC returns errors, handle them and disable the associated nodes
instead of panicking and halting all forward progress.

Added some new classes.

Handle troublesome CAPMC errors that don't point fingers and indentify
which nodes experienced the errors.

If CAPMC returns an error that cannot be associated with an individual
node, then reissue the CAPMC command for each individual component.
Under that condition, Any components returning an error is guaranteed
to be the cause of the error, so they are associated with the error
and disabled in order to not cause BOS to attempt to retry them.

As a consequence of handling CAPMC-returned errors, the power
operators need to be able to disable nodes. When they receive
errors from CAPMC, they can declare those nodes disabled.

Create a power_operator_base class because error handling is common
to all three power operators. This is an abstract base class
that each of the power operators inherits from. It collects all of
the error handling into one place, so that it is not spread across
the three power operators.
@jsollom-hpe jsollom-hpe force-pushed the CASMCMS-8754-Common-Power-Operator branch from 6da619e to ff6a4f0 Compare October 3, 2023 18:54
@jsollom-hpe jsollom-hpe merged commit b0fa7f0 into support/2.0 Oct 3, 2023
3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants