Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(nodeResources): add GPU support #1708

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

DexterYan
Copy link
Member

@DexterYan DexterYan commented Dec 19, 2024

Description, Motivation and Context

  • add resourceName and resourceAllocatable to filter gpu
  • add tests

ADR doc: https://docs.google.com/document/d/1LXuhzjzSsvuoOo4CnUXeq9SqfMYVPA0GpTNJj6PcX3g/edit?tab=t.0
sc-106618

Demo Yaml

apiVersion: troubleshoot.sh/v1beta2
kind: SupportBundle
metadata:
  name: sample
spec:
  collectors:
    - clusterResources: {}
  analyzers:
    - nodeResources:
        filters:
          resourceName: nvidia.com/gpu
        checkName: Must have at least 1 GPU-enabled nodes in the cluster
        outcomes:
          - pass:
              when: "count() >= 1"
              message: "This application requires at least 1 GPU-enabled nodes"
apiVersion: troubleshoot.sh/v1beta2
kind: SupportBundle
metadata:
  name: sample
spec:
  collectors:
    - clusterResources: {}
  analyzers:
    - nodeResources:
        filters:
          resourceName: nvidia.com/gpu
        checkName: Must have at least 1 GPU-enabled nodes in the cluster
        outcomes:
          - pass:
              when: "min(resourceAllocatable) = 1"
              message: "This application requires at least 1 GPU-enabled nodes"
Screenshot 2024-12-20 at 11 33 07 AM

Fixes: #1162

Checklist

  • New and existing tests pass locally with introduced changes.
  • Tests for the changes have been added (for bug fixes / features)
  • The commit message(s) are informative and highlight any breaking changes
  • Any documentation required has been added/updated. For changes to https://troubleshoot.sh/ create a PR here

Does this PR introduce a breaking change?

  • Yes
  • No

@DexterYan DexterYan added the type::feature New feature or request label Dec 19, 2024
@DexterYan DexterYan requested a review from a team as a code owner December 19, 2024 05:26
@@ -417,6 +417,10 @@ spec:
type: string
podCapacity:
type: string
resourceAllocatable:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

resourceCapacity is missing

@@ -382,6 +394,26 @@ func nodeMatchesFilters(node corev1.Node, filters *troubleshootv1beta2.NodeResou
return true, nil
}

if filters.ResourceName != "" {
if filters.ResourceAllocatable != "" {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

- nodeResources:
   filters:
     resourceName: gpu.intel.com/i915

Using the spec above, if a node without gpu.intel.com/i915 is present, this code will match it and count it in as a node with the intel GPU

We need to add a check here that checks if node.Status.Allocatable["gpu.intel.com/i915"] or node.Status.Capacity["gpu.intel.com/i915"] exist

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add a unit test for this use case

Comment on lines +196 to +198
if filters != nil && filters.ResourceName != "" {
resourceName = filters.ResourceName
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if filters != nil && filters.ResourceName != "" {
resourceName = filters.ResourceName
}
if filters != nil {
resourceName = filters.ResourceName
}

totalNodeCount: len(nodeData),
expected: true,
isError: false,
},
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: For completeness, I'd add a sum unit test as well

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type::feature New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Feature: add GPU capabilities to nodeResources analyzer
2 participants