Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CASMCMS-9225: Added basic paging ability for GET requests to list components #396

Merged
merged 7 commits into from
Dec 17, 2024

Conversation

mharding-hpe
Copy link
Contributor

@mharding-hpe mharding-hpe commented Dec 17, 2024

CASMCMS-9225 as a whole involves changes to a number of different files in BOS. In order to aid with review, I'm breaking the overall thing up into smaller PRs. Each will be built on top of each other, and will be merging into the main PR branch. Only once each sub-PR has been approved and merged will I merge that branch into develop.
Full list of sub-PRs:

  1. Added basic paging ability for GET requests to list components
  2. Create generic endpoint classes
  3. Create generic API client class
  4. Create ApiClients class and provide it to BOS operators inside a context manager
  5. Move PCS client to new paradigm
  6. Move BSS client to new paradigm
  7. Move CFS client to new paradigm
  8. Move BOS client to new paradigm
  9. Move IMS client to new paradigm
  10. Move HSM client to new paradigm
  11. CHANGELOG update, linting, update utils

CASMCMS-9225 overall is addressing problems I found when investigating possible resource leaks in BOS (related to Josh Williamson's memory leak investigation with PCS). In the past we've made other changes to BOS to ensure that its requests to services like PCS and CFS do not get excessively large (this was done mostly by adding the max_component_batch_size option). However, one thing that was not addressed by this was when the BOS operators poll BOS itself to figure out which components require action. In those cases, the size of the request responses can end up scaling linearly with the number of nodes in the system, and on mug I observed that this could lead to two problems:

  1. These requests were more likely to encounter timeouts, because it took the BOS server longer to assemble the very large responses.
  2. I observed evidence that any time BOS encountered a timeout, there was a chance of an apparent memory leak, and the size of the leaked memory appeared to be related to the size of the response it was receiving. Thus, timeouts on these very large responses led to large memory leaks, eventually resulting in OOMKills.

This PR makes the following changes:

  1. Adds two new options that can be specified when listing BOS components (max_page_size and start_after_id). These enable paging when listing BOS components. They are both optional, and if they are not specified, BOS works exactly as before. Thus, they are entirely backward-compatible. My current plan is not to add these to the Cray CLI -- their main purpose is for BOS internal use, although of course anyone using the API is free to use them. If there is demand, it would not be difficult to add them into the CLI, however.
  2. The code used to list BOS components had minor refactoring, to make it more efficient when iterating through the BOS components database, filtering out components and composing its responses. I did this because I discovered that even with paging, BOS could still be quite slow when responding to some queries, particularly ones where very few components matched the filters. In that case, BOS still had to go through its entire components database before providing a response. The refactoring made here led to a 3-4x speed improvement for such requests.

This PR does NOT yet modify the BOS operators to use these new abilities. That will be in a separate PR, because that code ended up being refactored for other reasons.

@mharding-hpe mharding-hpe requested a review from a team as a code owner December 17, 2024 14:33
@mharding-hpe mharding-hpe force-pushed the casmcms-9225-01-paging branch 2 times, most recently from 586cfa3 to 24bdb6b Compare December 17, 2024 14:47
@mharding-hpe mharding-hpe force-pushed the casmcms-9225-01-paging branch from 24bdb6b to 2b181ad Compare December 17, 2024 15:08
Copy link

@kumarrahul04 kumarrahul04 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review comments added.

src/bos/server/controllers/v2/components.py Outdated Show resolved Hide resolved
src/bos/server/controllers/v2/components.py Outdated Show resolved Hide resolved
src/bos/server/controllers/v2/components.py Outdated Show resolved Hide resolved
src/bos/server/controllers/v2/components.py Outdated Show resolved Hide resolved
src/bos/server/controllers/v2/components.py Show resolved Hide resolved
src/bos/server/redis_db_utils.py Show resolved Hide resolved
src/bos/server/redis_db_utils.py Show resolved Hide resolved
src/bos/server/redis_db_utils.py Show resolved Hide resolved
src/bos/server/redis_db_utils.py Show resolved Hide resolved
src/bos/server/redis_db_utils.py Outdated Show resolved Hide resolved
src/bos/server/redis_db_utils.py Outdated Show resolved Hide resolved
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants