Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update to work with kubernetes 1.32 #218

Merged
merged 2 commits into from
Dec 12, 2024
Merged

Conversation

klueska
Copy link
Collaborator

@klueska klueska commented Dec 11, 2024

No description provided.

@klueska klueska force-pushed the update-1.32 branch 2 times, most recently from be656d3 to bbc913d Compare December 11, 2024 13:38
@@ -48,7 +50,7 @@ func NewDriver(ctx context.Context, config *Config) (*driver, error) {

plugin, err := kubeletplugin.Start(
ctx,
driver,
[]any{driver},
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@klueska why is this changed to a slice?

@klueska
Copy link
Collaborator Author

klueska commented Dec 11, 2024

That's how Patrick changed the API to be able to support both v1alpha4 drivers and v1beta1 drivers simultaneously (if desired).

@klueska klueska merged commit 6c34f5f into NVIDIA:main Dec 12, 2024
6 checks passed
@zhouhao3
Copy link

Dynamic MIG is no longer possible in 1.31, but should be possible again in 1.32.

@klueska Hi, I noticed that you mentioned the appeal earlier.
I would like to ask whether Dynamic MIG is supported after this PR?
Because I tried it with the latest code and kubectl v1.32.0, and it seems not to work, so I want to confirm whether it is still not supported or there is something wrong with my usage.

@zhouhao3
Copy link

zhouhao3 commented Jan 2, 2025

@klueska Hi, I noticed that you mentioned the appeal earlier.
I would like to ask whether Dynamic MIG is supported after this PR?
Because I tried it with the latest code and kubectl v1.32.0, and it seems not to work, so I want to confirm whether it is still not supported or there is something wrong with my usage.

Detail environment:

$ k version
Client Version: v1.32.0
Kustomize Version: v5.5.0
Server Version: v1.32.0

$ docker exec -it ae10cc5e124b bash
root@k8s-dra-driver-cluster-worker:/# nvidia-smi 
Thu Jan  2 15:14:39 2025       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 545.23.08              Driver Version: 545.23.08    CUDA Version: 12.3     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA A100 80GB PCIe          On  | 00000000:1F:00.0 Off |                   On |
| N/A   32C    P0              42W / 300W |      0MiB / 81920MiB |     N/A      Default |
|                                         |                      |              Enabled |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| MIG devices:                                                                          |
+------------------+--------------------------------+-----------+-----------------------+
| GPU  GI  CI  MIG |                   Memory-Usage |        Vol|      Shared           |
|      ID  ID  Dev |                     BAR1-Usage | SM     Unc| CE ENC DEC OFA JPG    |
|                  |                                |        ECC|                       |
|==================+================================+===========+=======================|
|  No MIG devices found                                                                 |
+---------------------------------------------------------------------------------------+
                                                                                         
+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|  No running processes found                                                           |
+---------------------------------------------------------------------------------------+

Modified gpu-test4.yaml to MIG configuration supported by GPU

$ k apply -f demo/specs/quickstart/gpu-test4.yaml
namespace/gpu-test4 created
resourceclaimtemplate.resource.k8s.io/mig-devices created
deployment.apps/pod created

$ k get po -n gpu-test4
NAME                   READY   STATUS    RESTARTS   AGE
pod-59b9fd959c-4pkgs   0/4     Pending   0          32s
pod-59b9fd959c-87lxj   0/4     Pending   0          32s
pod-59b9fd959c-hkhh2   0/4     Pending   0          32s
pod-59b9fd959c-tn68c   0/4     Pending   0          32s


$ k describe po pod-59b9fd959c-4pkgs -n gpu-test4
......
Events:
  Type     Reason            Age   From               Message
  ----     ------            ----  ----               -------
  Warning  FailedScheduling  62s   default-scheduler  0/2 nodes are available: 1 cannot allocate all claims, 1 node(s) had untolerated taint {node-role.kubernetes.io/control-plane: }. still not schedulable, preemption: 0/2 nodes are available: 2 Preemption is not helpful for scheduling.

$ k get resourceSlice k8s-dra-driver-cluster-worker-gpu.nvidia.com-286hm  -o yaml
apiVersion: resource.k8s.io/v1beta1
kind: ResourceSlice
metadata:
  creationTimestamp: "2025-01-02T11:14:14Z"
  generateName: k8s-dra-driver-cluster-worker-gpu.nvidia.com-
  generation: 1
  name: k8s-dra-driver-cluster-worker-gpu.nvidia.com-286hm
  ownerReferences:
  - apiVersion: v1
    controller: true
    kind: Node
    name: k8s-dra-driver-cluster-worker
    uid: 76034755-b047-4377-80e5-469d83a5ca8e
  resourceVersion: "748"
  uid: a153b198-a87a-4b38-bbfb-623472f24aba
spec:
  devices: null
  driver: gpu.nvidia.com
  nodeName: k8s-dra-driver-cluster-worker
  pool:
    generation: 1
    name: k8s-dra-driver-cluster-worker
    resourceSliceCount: 1

$ k get resourceClaim -A
NAMESPACE   NAME                                     STATE     AGE               
gpu-test4   pod-59b9fd959c-4pkgs-mig-devices-tc89s   pending   6m12s
gpu-test4   pod-59b9fd959c-87lxj-mig-devices-4g4tq   pending   6m12s
gpu-test4   pod-59b9fd959c-hkhh2-mig-devices-2tlwh   pending   6m12s
gpu-test4   pod-59b9fd959c-tn68c-mig-devices-tfzcw   pending   6m12s

    
 $ nvidia-smi
Thu Jan  2 15:19:54 2025       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 545.23.08              Driver Version: 545.23.08    CUDA Version: 12.3     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA A100 80GB PCIe          On  | 00000000:1F:00.0 Off |                   On |
| N/A   32C    P0              42W / 300W |      0MiB / 81920MiB |     N/A      Default |
|                                         |                      |              Enabled |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| MIG devices:                                                                          |
+------------------+--------------------------------+-----------+-----------------------+
| GPU  GI  CI  MIG |                   Memory-Usage |        Vol|      Shared           |
|      ID  ID  Dev |                     BAR1-Usage | SM     Unc| CE ENC DEC OFA JPG    |
|                  |                                |        ECC|                       |
|==================+================================+===========+=======================|
|  No MIG devices found                                                                 |
+---------------------------------------------------------------------------------------+
                                                                                         
+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|  No running processes found                                                           |
+---------------------------------------------------------------------------------------+


Expected it to automatically configure MIG, but it didn't.

@klueska
Copy link
Collaborator Author

klueska commented Jan 7, 2025

It is not. Once this KEP is implemented it will be supported again:
https://github.com/kubernetes/enhancements/blob/master/keps/sig-node/4815-dra-partitionable-devices/README.md

Unfortunately, we ran out of time in 1.32 to build the implementation. It should be done for 1.33 (as an alpha feature).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants