-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CASMHMS-6299: Fix PCS/TRS resource leaks and scaling issues #57
Conversation
- Pass log level to TRS - Update to new TRS
- Massive plug of resource leak - response bodies not getting closed
optimizing builds but since Go 1.10 the go build systems uses automatic caching to handle what -i did.
👋 Hey! Here is the image we built for you (Artifactory Link): artifactory.algol60.net/csm-docker/unstable/cray-power-control-hmth-test:2.6.0-20241125214211.a0f8a6b Use podman or docker to pull it down and inspect locally: podman pull artifactory.algol60.net/csm-docker/unstable/cray-power-control-hmth-test:2.6.0-20241125214211.a0f8a6b Or, use this script to pull the image from the build server to a dev system: Dev System Pull Script
#!/usr/bin/env bash
IMAGE=artifactory.algol60.net/csm-docker/unstable/cray-power-control-hmth-test:2.6.0-20241125214211.a0f8a6b
podman run --rm --network host \
quay.io/skopeo/stable copy \
--src-tls-verify=false \
--dest-tls-verify=false \
--dest-username "$(kubectl -n nexus get secret nexus-admin-credential -o json | jq -r '.data.username | @base64d')" \
--dest-password "$(kubectl -n nexus get secret nexus-admin-credential -o json | jq -r '.data.password | @base64d')" \
docker://$IMAGE \
docker://registry.local/$IMAGE Snyk ReportComing soon Software Bill of Materialscosign download sbom artifactory.algol60.net/csm-docker/unstable/cray-power-control-hmth-test:2.6.0-20241125214211.a0f8a6b > container_image.spdx If you don't have cosign, then you can get it here. Note: this SHA is the merge of 42fa19e and the PR base branch. Good luck and make rocket go now! 🌮 🚀 |
👋 Hey! Here is the image we built for you (Artifactory Link): artifactory.algol60.net/csm-docker/unstable/cray-power-control:2.6.0-20241125214218.a0f8a6b Use podman or docker to pull it down and inspect locally: podman pull artifactory.algol60.net/csm-docker/unstable/cray-power-control:2.6.0-20241125214218.a0f8a6b Or, use this script to pull the image from the build server to a dev system: Dev System Pull Script
#!/usr/bin/env bash
IMAGE=artifactory.algol60.net/csm-docker/unstable/cray-power-control:2.6.0-20241125214218.a0f8a6b
podman run --rm --network host \
quay.io/skopeo/stable copy \
--src-tls-verify=false \
--dest-tls-verify=false \
--dest-username "$(kubectl -n nexus get secret nexus-admin-credential -o json | jq -r '.data.username | @base64d')" \
--dest-password "$(kubectl -n nexus get secret nexus-admin-credential -o json | jq -r '.data.password | @base64d')" \
docker://$IMAGE \
docker://registry.local/$IMAGE Snyk ReportComing soon Software Bill of Materialscosign download sbom artifactory.algol60.net/csm-docker/unstable/cray-power-control:2.6.0-20241125214218.a0f8a6b > container_image.spdx If you don't have cosign, then you can get it here. Note: this SHA is the merge of 42fa19e and the PR base branch. Good luck and make rocket go now! 🌮 🚀 |
for power transitions and capping to be configured - At PCS start time, log all env variables that were set - Log start and end of large batched requests to BMCs
Also required updating go.mod dependencies
…after we process responses. This frees up resources as soon as they are no longer needed. Also, fixed a missing response body drain/close.
- New version of TRS that allows retry counts of 0
- Attempt lower comp query number to plug memory leak - Attempt HTTP2 in TRS to plug memory leak - Both prior will be reverted if they don't fix last leak
Add more pod name logging New TRS to revert HTTP2 force - did not work
Summary and Scope
A variety of changes related to PCS scaling issues being experienced in the field. Some details are as follows:
One thing to note about adding support for connections pools, this only provides the capability on the client side, for users of TRS. Remote servers (or intermediate firewalls, proxies, etc) may not be configured, or desire, to maintain open idle connections for a period of time. We discovered that while some BMCs kept connections open, others didn't.
The TRS PR these changes depends on: Cray-HPE/hms-trs-app-api#11
The PCS helm chart PR: Cray-HPE/hms-power-control-charts#38
Issues and Related PRs
Testing
Test description:
Testing was two-pronged, unit testing (in TRS) and testing on a live system. The majority of testing was done through the TRS unit test framework. Extensive unit tests were added for all aspects of the connection lifecycle.
These changes were also tested alongside TRS changes on both
mug
androcket
. Rocket is a CSM 1.5 system and mug is a CSM 1.6 system. Additionally, mug was running several thousand simulated BMCs on compute nodes so that the scaling aspects of these changes could be tested. All standard PCS operations were confirmed to continue working.While we do not have any systems internally that are able to reproduce all of the problems at customer sites, we believe this change will go a long ways towards improving most of the issues being experienced.
Testing Check List:
Pull Request Checklist