Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature: add GPU capabilities to nodeResources analyzer #1162

Open
adamancini opened this issue May 17, 2023 · 7 comments · May be fixed by #1708
Open

Feature: add GPU capabilities to nodeResources analyzer #1162

adamancini opened this issue May 17, 2023 · 7 comments · May be fixed by #1708
Labels
type::feature New feature or request

Comments

@adamancini
Copy link
Member

Describe the rationale for the suggested feature.

It would be good to be able to support preflights that want to check for GPU scheduling capability. Off-hand, I don't know if this is visible in node metadata, but maybe could be detected from containerd configuration? This might require a new collector or modifications to the nodeResources collector to detect if a node is capable of scheduling GPUS, and provide capacity/allocation similar to CPU, Memory, Disk.

Describe the feature

Not sure exactly which fields would be required, if Allocatable makes sense, but at a minimum something like:

gpuCapacity - # of GPUs available to a node

so you can write expressions like

- nodeResources:
        checkName: Total GPU Cores in the cluster is 4 or greater
        outcomes:
          - fail:
              when: "sum(gpuCapacity) < 4"
              message: The cluster must contain at least 4 GPUs
          - pass:
              message: There are at least 4 GPUs
@adamancini adamancini added the type::feature New feature or request label May 17, 2023
@adamancini
Copy link
Member Author

adamancini commented May 17, 2023

Thinking through this a little bit there are a few places we can try to detect for GPU support

  1. containerd configuration
  2. nvidia-smi output
  3. node metadata
  4. run a no-op pod requesting GPUs and wait for successful exit

2: This can at least tell us if a GPU is installed, but not if Kubernetes is configured
3: I don't know if the information we need is exposed in node metadata, requires research
1,4: I think are the best options since they are the closes to a functional test confirming that GPU workloads can be scheduled

@diamonwiggins
Copy link
Member

diamonwiggins commented May 17, 2023

Adding some thoughts from a discussion in Slack, on the node metadata angle, we may be able to determine from containerRuntimeVersion at least when the nvidia-container-runtime for containerd is being used. Not sure if that'll be robust enough though. Imagine it could work for most cases.

from my local env:

    nodeInfo:
      architecture: amd64
      bootID: 81e20091-22da-4866-bfe4-a980057a1adf
      containerRuntimeVersion: containerd://1.5.9-k3s1
      kernelVersion: 5.15.49-linuxkit
      .....

@chris-sanders
Copy link
Member

chris-sanders commented May 17, 2023

Just chiming in on the number of GPU's question. I think this is going to be implementation specific. I don't know if we can measure it. I know for the Intel Gpu Plugin it can be configured to allow sharing gpu's or not. So the question isn't just how many gpu's are present but are they all fully scheduled.

I think we're going to have to be specific on the gpu drives and providers to make any real attempt at this. Creating a pod seems like the most universal method but it's going to require the user to define that pod. Again using the Intel Gpu Driver there is no containerd configuration to review and the tracking of the resources is via a resource line that calls requires the gpu driver be listed explicitly.

Here's an example:

  resources:
    limits:
      gpu.intel.com/i915: 1

Example of a node with intel gpu plugin. This node has both coral-tpu and intel-gpu's available. It's not configured to allow gpu sharing, so I'm not sure if the allocatable number would change if that were enabled. You'll notice containerd has no special configs. The coral-tpu doesn't show up as a resource it's just identified via a label from node-feature-discovery. It is a usb device, but I don't think that changes if it's an integrated device.

Name:               todoroki
Roles:              control-plane
Labels:             beta.kubernetes.io/arch=amd64
                    beta.kubernetes.io/os=linux
                    feature.node.kubernetes.io/coral-tpu=true
                    feature.node.kubernetes.io/intel-gpu=true
                    kubernetes.io/arch=amd64
                    kubernetes.io/hostname=todoroki
                    kubernetes.io/os=linux
                    node-role.kubernetes.io/control-plane=true
                    node.k0sproject.io/role=control-plane
Annotations:        csi.volume.kubernetes.io/nodeid: {"smb.csi.k8s.io":"todoroki"}
                    nfd.node.kubernetes.io/extended-resources:
                    nfd.node.kubernetes.io/feature-labels: coral-tpu,intel-gpu
                    nfd.node.kubernetes.io/master.version: v0.13.0
                    nfd.node.kubernetes.io/worker.version: v0.13.0
                    node.alpha.kubernetes.io/ttl: 0
                    volumes.kubernetes.io/controller-managed-attach-detach: true
...
Addresses:
  InternalIP: 
  Hostname:    todoroki
Capacity:
  cpu:                 8
  ephemeral-storage:   489580536Ki
  gpu.intel.com/i915:  1
  hugepages-1Gi:       0
  hugepages-2Mi:       0
  memory:              16231060Ki
  pods:                110
Allocatable:
  cpu:                 8
  ephemeral-storage:   451197421231
  gpu.intel.com/i915:  1
  hugepages-1Gi:       0
  hugepages-2Mi:       0
  memory:              16128660Ki
  pods:                110
System Info:
  Machine ID:                
  System UUID:                
  Boot ID:                 
  Kernel Version:             5.4.0-137-generic
  OS Image:                   Ubuntu 20.04.5 LTS
  Operating System:           linux
  Architecture:               amd64
  Container Runtime Version:  containerd://1.6.18
  Kubelet Version:            v1.26.2+k0s
  Kube-Proxy Version:         v1.26.2+k0s
...
Allocated resources:
  (Total limits may be over 100 percent, i.e., overcommitted.)
  Resource            Requests     Limits
  --------            --------     ------
  cpu                 1630m (20%)  2 (25%)
  memory              1072Mi (6%)  1936Mi (12%)
  ephemeral-storage   0 (0%)       0 (0%)
  hugepages-1Gi       0 (0%)       0 (0%)
  hugepages-2Mi       0 (0%)       0 (0%)
  gpu.intel.com/i915  1            1

@DexterYan
Copy link
Member

If those nodes are running on cloud, we can use instance metadata to get GPU information. Like AWS, it has

elastic-gpus/associations/elastic-gpu-id

However, for on premise, I think we may need to introduce a kURL add-on to add different GPU device plugins. It has to be pre-defined in the kURL installer.

@chris-sanders
Copy link
Member

However, for on premise, I think we may need to introduce a kURL add-on to add different GPU device plugins. It has to be pre-defined in the kURL installer.

I'm not sure what this part is referring to. This is about troubleshoot detecting the presences of gpu's not about kurl installing drivers that's out of scope for troubleshoot. How the drivers or gpu gets setup is only relevant here as it pertains to detection. As long as troubleshoot has a way to detect a gpu we don't particularly need to care how it got installed.

@diamonwiggins
Copy link
Member

diamonwiggins commented May 19, 2023

I think @chris-sanders has landed on what I think will be the best approach here after digging into this more and talking with some customers. I think we'd essentially have one or more collectors that can do similar feature discovery as the below projects and then let an analyzer analyze on the configuration collected. See:

https://github.com/kubernetes-sigs/node-feature-discovery
https://github.com/NVIDIA/gpu-feature-discovery

edit. With that being said, not sure if we should start capturing this in a separate issue since I'm not sure if what i'm describing makes sense in the nodeResources analyzer 🤔

@xavpaice
Copy link
Member

@banjoh banjoh linked a pull request Dec 20, 2024 that will close this issue
6 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type::feature New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants