-
Notifications
You must be signed in to change notification settings - Fork 61
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support for fabric-attached GPUs #2
Comments
I want to share some concerns about supporting fabric-attached gpus and I really hope that we can create a driver that support fabric-attached gpus. Before DRA, to use NVIDIA GPU in kubernetes, there are device plugin and the related NVIDIA GPU operator, so we tested if they can support dynamically attaching/detaching gpus and the result is no because of the next problems:
The above problems come from the limitation of the device plugin API and as DRA can do much more than device plugin including using fabric-attached devices, I really hope that we can find a feasible way to create a DRA driver that support fabric-attached gpus and would resolve the above problems. |
Support for fabric attached GPUs is coming soon. It will be enabled through a technology known as IMEX, which allows the GPU on one node to securely read/write GPU memory on other nodes within the same "IMEX domain": From an end user's perspective, a resource claim will be used to request access to a shared IMEX channel within a given IMEX domain, and all of the pods running on nodes that want to use this channel to read/write each others GPU memory will need to reference this claim request. |
In the KEP, the use cases and goals of DRA include using devices dynamically attached from the fabric, for this DRA driver, does NVIDIA have any plan to support fabric-attached GPUs?
Considering that the work of allocating fabric-attached resources contains both infrastructure provider's work (configuring the fabric)
and device vendor specific work (setting up the environment, configuring devices...), I am curious and confused about who should and how to implement the related driver.
Should the infrastructure provider create a custom driver which is able to do both work or should a device vendor create a driver for handling device specific work and enble it to talk to some remote fabric manager to request fabric-attached devices? For example, in this dra driver, add a gRPC client which can ask the remote server to filter out unsuitable nodes for attaching gpus and attach gpus to a specific node or dettach gpus from a specific node, so that the driver can use both local and fabric-attached gpus.
Maybe a common api between a dra driver and a fabric controller component should be discussed.
I would like to know the thought of NVIDIA about such things.
The text was updated successfully, but these errors were encountered: