Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix check for nvidia-device-plugin-daemonset when deploying NVIDIA operator stack #1871

Merged
merged 8 commits into from
Oct 2, 2024

Conversation

bdattoma
Copy link
Contributor

@bdattoma bdattoma commented Sep 30, 2024

As of now, when deploying the NVIDIA GPU operator stack, the nvidia-device-plugin-daemonset gets restarted after being in "init" status. Our script has already started waiting for the pod to be in Ready status but it waits for a pod which won't never be up and running. In addition, it has a 20 minutes timeout which makes CI spending more time than needed

Solution:

  • instead of waiting 20 minutes after fetching the pod name, it waits only 10seconds and if the pod is not ready, it continues the loop (i.e., fetching the pod name again, wait 10 s, etc)
  • wait for the GPU node to be ready before deploying the NVIDIA Stack, it reduces the risk of hitting the timeouts

PR validation:

  1. provision smaller GPU on IBM cloud: rhods-ci-pr-test/3368 PASS - the job failure at the end is not related to this PR. It took 15 minutes from provisioning to stack deployment
  2. provision bigger GPU on GCP: rhods-ci-pr-test/3371 - approx 15 min for e2e flow (provisioning + operator installtion)
  3. test provisioning without this PR and compare timing: rhods-ci-pr-test/3372 PASS - it took 28 minutes for e2e flow (provisioning + operator installtion) - the job failure at the end is not related to this PR

Copy link
Contributor

github-actions bot commented Sep 30, 2024

Robot Results

✅ Passed ❌ Failed ⏭️ Skipped Total Pass %
547 0 0 547 100

@bdattoma bdattoma self-assigned this Sep 30, 2024
@bdattoma bdattoma added needs testing Needs to be tested in Jenkins enhancements Bugfixes, enhancements, refactoring, ... in tests or libraries (PR will be listed in release-notes) do not merge Do not merge this yet please verified This PR has been tested with Jenkins and removed needs testing Needs to be tested in Jenkins do not merge Do not merge this yet please labels Sep 30, 2024
@bdattoma bdattoma requested a review from kobihk October 1, 2024 16:32
Copy link

sonarqubecloud bot commented Oct 2, 2024

@bdattoma bdattoma merged commit bf9fb11 into red-hat-data-services:master Oct 2, 2024
8 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancements Bugfixes, enhancements, refactoring, ... in tests or libraries (PR will be listed in release-notes) verified This PR has been tested with Jenkins
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants