CASMTRIAGE-7594 - better resiliency. #59
Merged
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Summary and Scope
There was a problem where a worker node's console was not being monitored by anyone because it was getting assigned/unassigned to the console-node pod that was running on that worker and the other pod didn't have capacity to pick it up. The end result was it was getting picked up and dropped continually while forcing the conmand process to keep getting killed (resulting in incomplete logging of all the nodes in that pod).
The fix was to rework how node acquisition and rebalancing happens to be much more stable and aware of the current status of the services. This fix requires changes in all three console repos.
Issues and Related PRs
Testing
Tested on:
Surtur
Test description:
I installed all 3 new versions of the console services and monitored the node acquisition through pod rollout, then forced pod failures to insure the remaining pods picked up the orphaned nodes, then correctly rebalanced when all console-node pods were back up and running.
Risks and Mitigations
This is a medium size change, but I spent several days testing in all situations I could think up.
Pull Request Checklist