Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Topological alignment between GPUs and NICs in DRA (exposing pci device topology as device attribute?) #213

Open
everpeace opened this issue Dec 3, 2024 · 9 comments

Comments

@everpeace
Copy link

everpeace commented Dec 3, 2024

I understand DRA will finally promote to Beta in v1.32🎉 Thank you very much contributors for your hard work standardizing flexible device scheduling and implementing NVIDIA's dra-driver.

Do you have a plan exposing intra-node topology as device attribute?? Especially distances between GPU<->GPU and GPU<->NIC or HCA (I imagine nvidia-smi topo -m equivalent information)? Or, would you have a plan to provide some extension point to add user-defined device attribute in this dar-driver??

I imagine below usecases for optimizing training performance:

  • Single Node Multi GPUs:
    • a user wants to have 1 pod with 2 gpus which are connected via NVLink each other (NV# in nvidia-smi topo -m)
      → discussed in NVLINK Aware Scheduling #214
  • Multi Node Multi GPUs:

Thanks, in advance.

@klueska
Copy link
Collaborator

klueska commented Dec 4, 2024

I'm open to suggestions on what these attributes would look like and how they would be used, but as I mentioned in my comment here #214 (comment), I've struggled to come up with something that would actually be useful.

@everpeace
Copy link
Author

Thanks,

Or, would you have a plan to provide some extension point to add user-defined device attribute in this dar-driver??

How about this? If this driver provides such knob, user will be able to publish their own extra attributes for thier needs.

@everpeace
Copy link
Author

everpeace commented Dec 11, 2024

  • Multi Node Multi GPUs:
    • a user wants like to have N pods per 4 gpus each of which have adjacent NIC or HCA

For this use case, I found the presentation excactly matched this case.

Better Together! GPU, TPU and NIC Topological Alignment with DRA - John Belamaric, Google & Patrick Ohly, Intel

So, If both NVIDIA/k8s-dra-driver and kubernetes-sigs/cni-dra-driver exposed k8s.io/pcieRoot attribute, user can define the ResourceClaim like below, as described in the session:

apiVersion: resource.k8s.io/v1alpha2
kind: ResourceClaim 
metadata:
  name: big-gpu-with-aligned-nic
spec:
  devices:
  requests:
  - name: gpu
    deviceClassName: gpu.nvidia.com
    selectors:
    - cel:
        expression: "device capacity['memory'].compareTo(quantity ('80Gi')) >= 0"
  - name: nic
    deviceClassName: rdma.nvidia.com
    selectors:
    - cel:
        expression: "device.attribute[ 'sriovType'] == 'vf'"
  constraints:
  - requestNames: ["gpu", "nic"]
    matchAttribute: k8s.io/pcieRoot

Thus, I would like to know if NVIDIA/k8s-dra-driver plans to expose k8s.io/pcieRoot attribute.

Thanks in advance.


  • Single Node Multi GPUs:
    • a user wants to have 1 pod with 2 gpus which are connected via NVLink each other (NV# in nvidia-smi topo -m)

Because #214 clearly describes this case, I re-phrased this issue title for isolated discussion.

@everpeace everpeace changed the title Exposing Intra-Node Topology (Distance between GPUs, or GPU and NIC, HCA) as device attributes Topological alignment between GPUs and NICs in DRA (exposing pci device topology as device attribute?) Dec 11, 2024
@klueska
Copy link
Collaborator

klueska commented Dec 11, 2024

Unfortunately, we can't include this until we start to standardize the set of attributes we put under the k8s.io/* prefix. I could include this under the nvidia.com/* prefix, but this is less useful since you can't then use it in a matchAtrribute to match with an attribute with a different prefix from a different driver.

@everpeace
Copy link
Author

Thanks for the quick reply. OK, then, let me keep this open for now.

@everpeace
Copy link
Author

everpeace commented Jan 14, 2025

This KEP could be potential solution as described.
issue: kubernetes/enhancements#5027
PR: kubernetes/enhancements#5034

@johnbelamaric
Copy link

That KEP looks interesting. I also think it could be useful for the driver (actually, the base driver framework that we would prefer all drivers to use) to have a hook to allow VM architects to augment the device attributes published by the driver.

For example, dropping a file on the node that can tell you which external network each NIC is plumbed to.

Patrick's KEP gives the cluster admin an opportunity to enhance attributes. That could be sufficient to do what I am saying. But it may also be helpful to have an on-node way of doing this.

@johnbelamaric
Copy link

Also, I added creating a k8s.io attribute like this as a GA criteria...and I hope to target GA to 1.34.

https://github.com/kubernetes/enhancements/tree/master/keps/sig-node/4381-dra-structured-parameters#ga

@everpeace
Copy link
Author

For example, dropping a file on the node that can tell you which external network each NIC is plumbed to.

Oh, that's also a nice idea. It would be easier and more handy for cluster admins. In this case, should we define what format will be supported (standard YAML/JSON in the ResourceSliceOverride object)?

Also, I added creating a k8s.io attribute like this as a GA criteria...and I hope to target GA to 1.34.

Cool! Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants