CASMTRIAGE-7594 - better resiliency. #59

dlaine-hpe · 2024-12-19T17:33:35Z

Summary and Scope

There was a problem where a worker node's console was not being monitored by anyone because it was getting assigned/unassigned to the console-node pod that was running on that worker and the other pod didn't have capacity to pick it up. The end result was it was getting picked up and dropped continually while forcing the conmand process to keep getting killed (resulting in incomplete logging of all the nodes in that pod).

The fix was to rework how node acquisition and rebalancing happens to be much more stable and aware of the current status of the services. This fix requires changes in all three console repos.

Issues and Related PRs

Resolves CASMTRIAGE-7594

Testing

Tested on:

Surtur

Test description:

I installed all 3 new versions of the console services and monitored the node acquisition through pod rollout, then forced pod failures to insure the remaining pods picked up the orphaned nodes, then correctly rebalanced when all console-node pods were back up and running.

Were the install/upgrade-based validation checks/tests run (goss tests/install-validation doc)? N
Were continuous integration tests run? If not, why? N
Was upgrade tested? If not, why? N
Was downgrade tested? If not, why? N
Were new tests (or test issues/Jiras) created for this change? N

Risks and Mitigations

This is a medium size change, but I spent several days testing in all situations I could think up.

Pull Request Checklist

Version number(s) incremented, if applicable
Copyrights updated
License file intact
Target branch correct
CHANGELOG.md updated
Testing is appropriate and complete, if applicable

console_data_svc/datastore.go

dlaine-hpe requested a review from a team as a code owner December 19, 2024 17:33

jsollom-hpe approved these changes Dec 19, 2024

View reviewed changes

console_data_svc/datastore.go Outdated Show resolved Hide resolved

CASMTRIAGE-7594 - better resiliency.

5999c03

dlaine-hpe force-pushed the CASMTRIAGE-7594 branch from 3f6b893 to 5999c03 Compare December 19, 2024 21:10

dlaine-hpe merged commit 068fbf0 into develop Dec 19, 2024
4 checks passed

dlaine-hpe deleted the CASMTRIAGE-7594 branch December 19, 2024 21:44

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CASMTRIAGE-7594 - better resiliency. #59

CASMTRIAGE-7594 - better resiliency. #59

dlaine-hpe commented Dec 19, 2024

CASMTRIAGE-7594 - better resiliency. #59

CASMTRIAGE-7594 - better resiliency. #59

Conversation

dlaine-hpe commented Dec 19, 2024

Summary and Scope

Issues and Related PRs

Testing

Tested on:

Test description:

Risks and Mitigations

Pull Request Checklist