Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CASMHMS-6299: Fix PCS/TRS resource leaks and scaling issues #57

Merged
merged 52 commits into from
Nov 25, 2024

Conversation

jwlv
Copy link
Contributor

@jwlv jwlv commented Nov 22, 2024

Summary and Scope

A variety of changes related to PCS scaling issues being experienced in the field. Some details are as follows:

  • Updated hms-trs-app-api vendor code (bug fixes and enhancements)
  • Passed PCS's log level through to TRS to match PCS's
  • Configured TRS to use connection pools for status requests to BMCs
  • Renamed PCS_STATUS_HTTP_TIMEOUT to PCS_STATUS_TIMEOUT as it is not an http timeout
  • Added PCS_MAX_IDLE_CONNS and PCS_MAX_IDLE_CONNS_PER_HOST env variables which allow overriding PCS's connection pool settings in TRS
  • Added PCS_BASE_TRS_TASK_TIMEOUT env variable which allows the timeout for power transitions and capping to be configured
  • The above variables are configurable on PCS's helm chart
  • At PCS start time, log all env variables that were set
  • Added PodName global to facilitate easier debug of log messages
  • Log start and end of large batched requests to BMCs
  • Fixed many resource leaks associated with making http requests and using TRS
  • Update required version of Go to 1.23 to avoid https://github.com/golang/go/issues/59017

One thing to note about adding support for connections pools, this only provides the capability on the client side, for users of TRS. Remote servers (or intermediate firewalls, proxies, etc) may not be configured, or desire, to maintain open idle connections for a period of time. We discovered that while some BMCs kept connections open, others didn't.

The TRS PR these changes depends on: Cray-HPE/hms-trs-app-api#11

The PCS helm chart PR: Cray-HPE/hms-power-control-charts#38

Issues and Related PRs

Testing

Test description:

Testing was two-pronged, unit testing (in TRS) and testing on a live system. The majority of testing was done through the TRS unit test framework. Extensive unit tests were added for all aspects of the connection lifecycle.

These changes were also tested alongside TRS changes on both mug and rocket. Rocket is a CSM 1.5 system and mug is a CSM 1.6 system. Additionally, mug was running several thousand simulated BMCs on compute nodes so that the scaling aspects of these changes could be tested. All standard PCS operations were confirmed to continue working.

While we do not have any systems internally that are able to reproduce all of the problems at customer sites, we believe this change will go a long ways towards improving most of the issues being experienced.

Testing Check List:

  • Were the install/upgrade-based validation checks/tests run (goss tests/install-validation doc)? Y
  • Were continuous integration tests run? If not, why? Y
  • Was upgrade tested? If not, why? Y
  • Was downgrade tested? If not, why? Y

Pull Request Checklist

  • Version number(s) incremented, if applicable
  • Copyrights updated
  • License file intact
  • Target branch correct
  • CHANGELOG.md updated
  • Testing is appropriate and complete, if applicable

@jwlv jwlv requested review from a team as code owners November 22, 2024 03:28
Copy link

github-actions bot commented Nov 22, 2024

👋 Hey! Here is the image we built for you (Artifactory Link):

artifactory.algol60.net/csm-docker/unstable/cray-power-control-hmth-test:2.6.0-20241125214211.a0f8a6b

Use podman or docker to pull it down and inspect locally:

podman pull artifactory.algol60.net/csm-docker/unstable/cray-power-control-hmth-test:2.6.0-20241125214211.a0f8a6b

Or, use this script to pull the image from the build server to a dev system:

Dev System Pull Script

Note the following script only applies to systems running CSM 1.2 or later.

#!/usr/bin/env bash

IMAGE=artifactory.algol60.net/csm-docker/unstable/cray-power-control-hmth-test:2.6.0-20241125214211.a0f8a6b

podman run --rm --network host  \
    quay.io/skopeo/stable copy \
    --src-tls-verify=false \
    --dest-tls-verify=false \
    --dest-username "$(kubectl -n nexus get secret nexus-admin-credential -o json | jq -r '.data.username | @base64d')" \
    --dest-password "$(kubectl -n nexus get secret nexus-admin-credential -o json | jq -r '.data.password | @base64d')" \
    docker://$IMAGE \
    docker://registry.local/$IMAGE
Snyk Report

Coming soon

Software Bill of Materials
cosign download sbom artifactory.algol60.net/csm-docker/unstable/cray-power-control-hmth-test:2.6.0-20241125214211.a0f8a6b > container_image.spdx

If you don't have cosign, then you can get it here.

Note: this SHA is the merge of 42fa19e and the PR base branch. Good luck and make rocket go now! 🌮 🚀

Copy link

github-actions bot commented Nov 22, 2024

👋 Hey! Here is the image we built for you (Artifactory Link):

artifactory.algol60.net/csm-docker/unstable/cray-power-control:2.6.0-20241125214218.a0f8a6b

Use podman or docker to pull it down and inspect locally:

podman pull artifactory.algol60.net/csm-docker/unstable/cray-power-control:2.6.0-20241125214218.a0f8a6b

Or, use this script to pull the image from the build server to a dev system:

Dev System Pull Script

Note the following script only applies to systems running CSM 1.2 or later.

#!/usr/bin/env bash

IMAGE=artifactory.algol60.net/csm-docker/unstable/cray-power-control:2.6.0-20241125214218.a0f8a6b

podman run --rm --network host  \
    quay.io/skopeo/stable copy \
    --src-tls-verify=false \
    --dest-tls-verify=false \
    --dest-username "$(kubectl -n nexus get secret nexus-admin-credential -o json | jq -r '.data.username | @base64d')" \
    --dest-password "$(kubectl -n nexus get secret nexus-admin-credential -o json | jq -r '.data.password | @base64d')" \
    docker://$IMAGE \
    docker://registry.local/$IMAGE
Snyk Report

Coming soon

Software Bill of Materials
cosign download sbom artifactory.algol60.net/csm-docker/unstable/cray-power-control:2.6.0-20241125214218.a0f8a6b > container_image.spdx

If you don't have cosign, then you can get it here.

Note: this SHA is the merge of 42fa19e and the PR base branch. Good luck and make rocket go now! 🌮 🚀

jwlv added 21 commits November 24, 2024 07:50
  for power transitions and capping to be configured
- At PCS start time, log all env variables that were set
- Log start and end of large batched requests to BMCs
Also required updating go.mod dependencies
…after

we process responses.  This frees up resources as soon as they are no
longer needed.

Also, fixed a missing response body drain/close.
- New version of TRS that allows retry counts of 0
- Attempt lower comp query number to plug memory leak
- Attempt HTTP2 in TRS to plug memory leak
- Both prior will be reverted if they don't fix last leak
    Add more pod name logging
    New TRS to revert HTTP2 force - did not work
Dockerfile Outdated Show resolved Hide resolved
Dockerfile.integration.Dockerfile Outdated Show resolved Hide resolved
Dockerfile.test.unit.Dockerfile Outdated Show resolved Hide resolved
@jwlv jwlv merged commit 5ddb830 into master Nov 25, 2024
17 checks passed
@jwlv jwlv deleted the CASMHMS-6299.with_trs branch November 25, 2024 22:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants