Update to work with kubernetes 1.32 #218

klueska · 2024-12-11T13:08:47Z

No description provided.

guptaNswati · 2024-12-11T20:28:26Z

cmd/nvidia-dra-plugin/driver.go

@@ -48,7 +50,7 @@ func NewDriver(ctx context.Context, config *Config) (*driver, error) {

 	plugin, err := kubeletplugin.Start(
 		ctx,
-		driver,
+		[]any{driver},


@klueska why is this changed to a slice?

klueska · 2024-12-11T20:31:26Z

That's how Patrick changed the API to be able to support both v1alpha4 drivers and v1beta1 drivers simultaneously (if desired).

Signed-off-by: Kevin Klues <[email protected]>

zhouhao3 · 2024-12-16T08:13:32Z

Dynamic MIG is no longer possible in 1.31, but should be possible again in 1.32.

@klueska Hi, I noticed that you mentioned the appeal earlier.
I would like to ask whether Dynamic MIG is supported after this PR?
Because I tried it with the latest code and kubectl v1.32.0, and it seems not to work, so I want to confirm whether it is still not supported or there is something wrong with my usage.

zhouhao3 · 2025-01-02T07:09:23Z

@klueska Hi, I noticed that you mentioned the appeal earlier.
I would like to ask whether Dynamic MIG is supported after this PR?
Because I tried it with the latest code and kubectl v1.32.0, and it seems not to work, so I want to confirm whether it is still not supported or there is something wrong with my usage.

Detail environment:

$ k version
Client Version: v1.32.0
Kustomize Version: v5.5.0
Server Version: v1.32.0

$ docker exec -it ae10cc5e124b bash
root@k8s-dra-driver-cluster-worker:/# nvidia-smi 
Thu Jan  2 15:14:39 2025       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 545.23.08              Driver Version: 545.23.08    CUDA Version: 12.3     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA A100 80GB PCIe          On  | 00000000:1F:00.0 Off |                   On |
| N/A   32C    P0              42W / 300W |      0MiB / 81920MiB |     N/A      Default |
|                                         |                      |              Enabled |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| MIG devices:                                                                          |
+------------------+--------------------------------+-----------+-----------------------+
| GPU  GI  CI  MIG |                   Memory-Usage |        Vol|      Shared           |
|      ID  ID  Dev |                     BAR1-Usage | SM     Unc| CE ENC DEC OFA JPG    |
|                  |                                |        ECC|                       |
|==================+================================+===========+=======================|
|  No MIG devices found                                                                 |
+---------------------------------------------------------------------------------------+
                                                                                         
+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|  No running processes found                                                           |
+---------------------------------------------------------------------------------------+

Modified gpu-test4.yaml to MIG configuration supported by GPU

$ k apply -f demo/specs/quickstart/gpu-test4.yaml
namespace/gpu-test4 created
resourceclaimtemplate.resource.k8s.io/mig-devices created
deployment.apps/pod created

$ k get po -n gpu-test4
NAME                   READY   STATUS    RESTARTS   AGE
pod-59b9fd959c-4pkgs   0/4     Pending   0          32s
pod-59b9fd959c-87lxj   0/4     Pending   0          32s
pod-59b9fd959c-hkhh2   0/4     Pending   0          32s
pod-59b9fd959c-tn68c   0/4     Pending   0          32s


$ k describe po pod-59b9fd959c-4pkgs -n gpu-test4
......
Events:
  Type     Reason            Age   From               Message
  ----     ------            ----  ----               -------
  Warning  FailedScheduling  62s   default-scheduler  0/2 nodes are available: 1 cannot allocate all claims, 1 node(s) had untolerated taint {node-role.kubernetes.io/control-plane: }. still not schedulable, preemption: 0/2 nodes are available: 2 Preemption is not helpful for scheduling.

$ k get resourceSlice k8s-dra-driver-cluster-worker-gpu.nvidia.com-286hm  -o yaml
apiVersion: resource.k8s.io/v1beta1
kind: ResourceSlice
metadata:
  creationTimestamp: "2025-01-02T11:14:14Z"
  generateName: k8s-dra-driver-cluster-worker-gpu.nvidia.com-
  generation: 1
  name: k8s-dra-driver-cluster-worker-gpu.nvidia.com-286hm
  ownerReferences:
  - apiVersion: v1
    controller: true
    kind: Node
    name: k8s-dra-driver-cluster-worker
    uid: 76034755-b047-4377-80e5-469d83a5ca8e
  resourceVersion: "748"
  uid: a153b198-a87a-4b38-bbfb-623472f24aba
spec:
  devices: null
  driver: gpu.nvidia.com
  nodeName: k8s-dra-driver-cluster-worker
  pool:
    generation: 1
    name: k8s-dra-driver-cluster-worker
    resourceSliceCount: 1

$ k get resourceClaim -A
NAMESPACE   NAME                                     STATE     AGE               
gpu-test4   pod-59b9fd959c-4pkgs-mig-devices-tc89s   pending   6m12s
gpu-test4   pod-59b9fd959c-87lxj-mig-devices-4g4tq   pending   6m12s
gpu-test4   pod-59b9fd959c-hkhh2-mig-devices-2tlwh   pending   6m12s
gpu-test4   pod-59b9fd959c-tn68c-mig-devices-tfzcw   pending   6m12s

    
 $ nvidia-smi
Thu Jan  2 15:19:54 2025       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 545.23.08              Driver Version: 545.23.08    CUDA Version: 12.3     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA A100 80GB PCIe          On  | 00000000:1F:00.0 Off |                   On |
| N/A   32C    P0              42W / 300W |      0MiB / 81920MiB |     N/A      Default |
|                                         |                      |              Enabled |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| MIG devices:                                                                          |
+------------------+--------------------------------+-----------+-----------------------+
| GPU  GI  CI  MIG |                   Memory-Usage |        Vol|      Shared           |
|      ID  ID  Dev |                     BAR1-Usage | SM     Unc| CE ENC DEC OFA JPG    |
|                  |                                |        ECC|                       |
|==================+================================+===========+=======================|
|  No MIG devices found                                                                 |
+---------------------------------------------------------------------------------------+
                                                                                         
+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|  No running processes found                                                           |
+---------------------------------------------------------------------------------------+

Expected it to automatically configure MIG, but it didn't.

klueska · 2025-01-07T20:03:40Z

It is not. Once this KEP is implemented it will be supported again:
https://github.com/kubernetes/enhancements/blob/master/keps/sig-node/4815-dra-partitionable-devices/README.md

Unfortunately, we ran out of time in 1.32 to build the implementation. It should be done for 1.33 (as an alpha feature).

klueska force-pushed the update-1.32 branch 2 times, most recently from be656d3 to bbc913d Compare December 11, 2024 13:38

guptaNswati reviewed Dec 11, 2024

View reviewed changes

klueska added 2 commits December 12, 2024 09:56

Update vendoring for update to kubernetes v1.32.0

ec7c59f

Signed-off-by: Kevin Klues <[email protected]>

Update deployment for Kubernetes 1.32.0

1258b53

Signed-off-by: Kevin Klues <[email protected]>

klueska force-pushed the update-1.32 branch from bbc913d to 1258b53 Compare December 12, 2024 09:57

klueska merged commit 6c34f5f into NVIDIA:main Dec 12, 2024
6 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update to work with kubernetes 1.32 #218

Update to work with kubernetes 1.32 #218

klueska commented Dec 11, 2024

guptaNswati Dec 11, 2024

klueska commented Dec 11, 2024

zhouhao3 commented Dec 16, 2024

zhouhao3 commented Jan 2, 2025

klueska commented Jan 7, 2025

Update to work with kubernetes 1.32 #218

Update to work with kubernetes 1.32 #218

Conversation

klueska commented Dec 11, 2024

guptaNswati Dec 11, 2024

Choose a reason for hiding this comment

klueska commented Dec 11, 2024

zhouhao3 commented Dec 16, 2024

zhouhao3 commented Jan 2, 2025

klueska commented Jan 7, 2025