Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CASMHMS-6299: Fix PCS/TRS resource leaks and scaling issues #11

Merged
merged 605 commits into from
Nov 25, 2024
Merged

Conversation

jwlv
Copy link
Contributor

@jwlv jwlv commented Nov 22, 2024

Summary and Scope

A variety of changes, mostly in support of PCS experiencing scaling issues in the field. Details are as follows:

  • TRS now supports configuration of connection counts and timeouts by callers
  • TRS no longer closes all idle connections when http or contexts time out
  • TRS no longer closes all idle connections when request retry limits are reached
  • Reworked several sections of code for clarity and reduced code duplication
  • Fixed bug where contexts were never being cancelled which lead to resource leaks
  • Fixed bug to prevent 2nd request if 1st request's context timed out or canceled
  • Additional tracing added for debug purposes
  • Unit tests: Now run in verbose mode so failures are more easily analyzed
  • Unit tests: Enabled TRS logging from inside unit tests
  • Unit tests: Error signature changed to make identifying errors easier
  • Unit tests: Reworked some existing unit tests
  • Unit tests: Numerous unit tests added to test connection states
  • Update required version of Go to 1.23 to avoid https://github.com/golang/go/issues/59017

One thing to note about adding support for connections pools, this only provides the capability on the client side, for users of TRS. Remote servers (or intermediate firewalls, proxies, etc) may not be configured, or desire, to maintain open idle connections for a period of time. We discovered that while some BMCs kept connections open, others didn't.

The PCS PR needing these changes: Cray-HPE/hms-power-control#57

The PCS helm chart PR: Cray-HPE/hms-power-control-charts#38

Issues and Related PRs

Testing

Test description:

Testing was two-pronged, unit testing and testing on a live system. The majority of testing was done through the unit test framework. Extensive unit tests were added for all aspects of the connection lifecycle. Many of the tests are not run in the build pipeline because the build pipeline limits the total unit test time to 10 minutes. Developers making changes in the future should (a) enable/disable each test so that it can confirm no regressions, or (b) and probably more appropriately, move the connection unit tests into the integration test framework in the build pipeline.

These changes were also tested alongside PCS changes on both mug and rocket. Rocket is a CSM 1.5 system and mug is a CSM 1.6 system. Additionally, mug was running several thousand simulated BMCs on compute nodes so that the scaling aspects of these changes could be tested. All standard PCS operations were confirmed to continue working.

While we do not have any systems internally that are able to reproduce all of the problems at customer sites, we believe this change will go a long ways towards improving most of the issues being experienced.

Testing Check List:

  • Were the install/upgrade-based validation checks/tests run (goss tests/install-validation doc)? N (not deployable on its own)
  • Were continuous integration tests run? If not, why? Y
  • Was upgrade tested? If not, why? N (not deployable on its own)
  • Was downgrade tested? If not, why? N (not deployable on its own)

Pull Request Checklist

  • Version number(s) incremented, if applicable
  • Copyrights updated
  • License file intact
  • Target branch correct
  • CHANGELOG.md updated
  • Testing is appropriate and complete, if applicable

@jwlv jwlv requested review from a team as code owners November 22, 2024 02:49
@jwlv jwlv requested review from mbuchmann-hpe and removed request for jacobsalmela November 22, 2024 03:43
@jwlv jwlv merged commit ffebdab into master Nov 25, 2024
5 checks passed
@jwlv jwlv deleted the CASMHMS-6299 branch November 25, 2024 20:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants