-
Notifications
You must be signed in to change notification settings - Fork 61
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Topological alignment between GPUs and NICs in DRA (exposing pci device topology as device attribute?) #213
Comments
I'm open to suggestions on what these attributes would look like and how they would be used, but as I mentioned in my comment here #214 (comment), I've struggled to come up with something that would actually be useful. |
Thanks,
How about this? If this driver provides such knob, user will be able to publish their own extra attributes for thier needs. |
For this use case, I found the presentation excactly matched this case. So, If both NVIDIA/k8s-dra-driver and kubernetes-sigs/cni-dra-driver exposed apiVersion: resource.k8s.io/v1alpha2
kind: ResourceClaim
metadata:
name: big-gpu-with-aligned-nic
spec:
devices:
requests:
- name: gpu
deviceClassName: gpu.nvidia.com
selectors:
- cel:
expression: "device capacity['memory'].compareTo(quantity ('80Gi')) >= 0"
- name: nic
deviceClassName: rdma.nvidia.com
selectors:
- cel:
expression: "device.attribute[ 'sriovType'] == 'vf'"
constraints:
- requestNames: ["gpu", "nic"]
matchAttribute: k8s.io/pcieRoot Thus, I would like to know if NVIDIA/k8s-dra-driver plans to expose Thanks in advance.
Because #214 clearly describes this case, I re-phrased this issue title for isolated discussion. |
Unfortunately, we can't include this until we start to standardize the set of attributes we put under the |
Thanks for the quick reply. OK, then, let me keep this open for now. |
This KEP could be potential solution as described. |
That KEP looks interesting. I also think it could be useful for the driver (actually, the base driver framework that we would prefer all drivers to use) to have a hook to allow VM architects to augment the device attributes published by the driver. For example, dropping a file on the node that can tell you which external network each NIC is plumbed to. Patrick's KEP gives the cluster admin an opportunity to enhance attributes. That could be sufficient to do what I am saying. But it may also be helpful to have an on-node way of doing this. |
Also, I added creating a k8s.io attribute like this as a GA criteria...and I hope to target GA to 1.34. |
Oh, that's also a nice idea. It would be easier and more handy for cluster admins. In this case, should we define what format will be supported (standard YAML/JSON in the
Cool! Thanks! |
I understand DRA will finally promote to Beta in v1.32🎉 Thank you very much contributors for your hard work standardizing flexible device scheduling and implementing NVIDIA's dra-driver.
Do you have a plan exposing intra-node topology as device attribute?? Especially distances between GPU<->GPU and GPU<->NIC or HCA (I imagine
nvidia-smi topo -m
equivalent information)? Or, would you have a plan to provide some extension point to add user-defined device attribute in this dar-driver??I imagine below usecases for optimizing training performance:
Single Node Multi GPUs:a user wants to have 1 pod with 2 gpus which are connected via NVLink each other (NV#
innvidia-smi topo -m
)→ discussed in NVLINK Aware Scheduling #214
PIX
innvidia-smi topo -m
) in specific zone(achieved by node selector)Thanks, in advance.
The text was updated successfully, but these errors were encountered: