Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CASMTRIAGE-7594 - better resiliency. #59

Merged
merged 1 commit into from
Dec 19, 2024
Merged

Conversation

dlaine-hpe
Copy link
Contributor

Summary and Scope

There was a problem where a worker node's console was not being monitored by anyone because it was getting assigned/unassigned to the console-node pod that was running on that worker and the other pod didn't have capacity to pick it up. The end result was it was getting picked up and dropped continually while forcing the conmand process to keep getting killed (resulting in incomplete logging of all the nodes in that pod).

The fix was to rework how node acquisition and rebalancing happens to be much more stable and aware of the current status of the services. This fix requires changes in all three console repos.

Issues and Related PRs

Testing

Tested on:

  • Surtur

Test description:

I installed all 3 new versions of the console services and monitored the node acquisition through pod rollout, then forced pod failures to insure the remaining pods picked up the orphaned nodes, then correctly rebalanced when all console-node pods were back up and running.

  • Were the install/upgrade-based validation checks/tests run (goss tests/install-validation doc)? N
  • Were continuous integration tests run? If not, why? N
  • Was upgrade tested? If not, why? N
  • Was downgrade tested? If not, why? N
  • Were new tests (or test issues/Jiras) created for this change? N

Risks and Mitigations

This is a medium size change, but I spent several days testing in all situations I could think up.

Pull Request Checklist

  • Version number(s) incremented, if applicable
  • Copyrights updated
  • License file intact
  • Target branch correct
  • CHANGELOG.md updated
  • Testing is appropriate and complete, if applicable

@dlaine-hpe dlaine-hpe requested a review from a team as a code owner December 19, 2024 17:33
console_data_svc/datastore.go Outdated Show resolved Hide resolved
@dlaine-hpe dlaine-hpe merged commit 068fbf0 into develop Dec 19, 2024
4 checks passed
@dlaine-hpe dlaine-hpe deleted the CASMTRIAGE-7594 branch December 19, 2024 21:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants