-
Notifications
You must be signed in to change notification settings - Fork 94
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feature: add GPU capabilities to nodeResources
analyzer
#1162
Comments
Thinking through this a little bit there are a few places we can try to detect for GPU support
2: This can at least tell us if a GPU is installed, but not if Kubernetes is configured |
Adding some thoughts from a discussion in Slack, on the node metadata angle, we may be able to determine from from my local env:
|
Just chiming in on the number of GPU's question. I think this is going to be implementation specific. I don't know if we can measure it. I know for the Intel Gpu Plugin it can be configured to allow sharing gpu's or not. So the question isn't just how many gpu's are present but are they all fully scheduled. I think we're going to have to be specific on the gpu drives and providers to make any real attempt at this. Creating a pod seems like the most universal method but it's going to require the user to define that pod. Again using the Intel Gpu Driver there is no containerd configuration to review and the tracking of the resources is via a resource line that calls requires the gpu driver be listed explicitly. Here's an example: resources:
limits:
gpu.intel.com/i915: 1 Example of a node with intel gpu plugin. This node has both coral-tpu and intel-gpu's available. It's not configured to allow gpu sharing, so I'm not sure if the allocatable number would change if that were enabled. You'll notice containerd has no special configs. The coral-tpu doesn't show up as a resource it's just identified via a label from node-feature-discovery. It is a usb device, but I don't think that changes if it's an integrated device.
|
If those nodes are running on cloud, we can use instance metadata to get GPU information. Like AWS, it has
However, for on premise, I think we may need to introduce a kURL add-on to add different GPU device plugins. It has to be pre-defined in the kURL installer. |
I'm not sure what this part is referring to. This is about troubleshoot detecting the presences of gpu's not about kurl installing drivers that's out of scope for troubleshoot. How the drivers or gpu gets setup is only relevant here as it pertains to detection. As long as troubleshoot has a way to detect a gpu we don't particularly need to care how it got installed. |
I think @chris-sanders has landed on what I think will be the best approach here after digging into this more and talking with some customers. I think we'd essentially have one or more collectors that can do similar feature discovery as the below projects and then let an analyzer analyze on the configuration collected. See: https://github.com/kubernetes-sigs/node-feature-discovery edit. With that being said, not sure if we should start capturing this in a separate issue since I'm not sure if what i'm describing makes sense in the |
Describe the rationale for the suggested feature.
It would be good to be able to support preflights that want to check for GPU scheduling capability. Off-hand, I don't know if this is visible in node metadata, but maybe could be detected from
containerd
configuration? This might require a new collector or modifications to the nodeResources collector to detect if a node is capable of scheduling GPUS, and provide capacity/allocation similar to CPU, Memory, Disk.Describe the feature
Not sure exactly which fields would be required, if
Allocatable
makes sense, but at a minimum something like:gpuCapacity
- # of GPUs available to a nodeso you can write expressions like
The text was updated successfully, but these errors were encountered: