You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I test this successfully in Cgroup v1, but Cgroup v2 was wrong.
Here are the logs :
2025-01-13T03:41:56.696Z INFO GPUMounter-worker/main.go:15 Service Starting...
2025-01-13T03:41:56.696Z INFO gpu-mount/server.go:22 Creating gpu mounter
2025-01-13T03:41:56.696Z INFO allocator/allocator.go:28 Creating gpu allocator
2025-01-13T03:41:56.696Z INFO collector/collector.go:24 Creating gpu collector
2025-01-13T03:41:56.696Z INFO collector/collector.go:42 Start get gpu info
2025-01-13T03:41:56.704Z INFO collector/collector.go:53 GPU Num: 1
2025-01-13T03:41:56.710Z INFO collector/collector.go:91 Updating GPU status
2025-01-13T03:41:56.711Z INFO collector/collector.go:136 GPU status update successfully
2025-01-13T03:41:56.711Z INFO collector/collector.go:36 Successfully update gpu status
2025-01-13T03:41:56.711Z INFO allocator/allocator.go:35 Successfully created gpu collector
2025-01-13T03:41:56.711Z INFO gpu-mount/server.go:29 Successfully created gpu allocator
2025-01-13T03:41:56.711Z INFO GPUMounter-worker/main.go:22 Successfully created gpu mounter
2025-01-13T03:41:58.732Z INFO gpu-mount/server.go:35 AddGPU Service Called
2025-01-13T03:41:58.732Z INFO gpu-mount/server.go:36 request: pod_name:"owner-pod" namespace:"default" gpu_num:1
2025-01-13T03:41:58.750Z INFO gpu-mount/server.go:55 Successfully get Pod: default in cluster
2025-01-13T03:41:58.750Z INFO allocator/allocator.go:159 Get pod default/owner-pod mount type
2025-01-13T03:41:58.750Z INFO collector/collector.go:91 Updating GPU status
2025-01-13T03:41:58.750Z INFO collector/collector.go:136 GPU status update successfully
2025-01-13T03:41:58.758Z INFO allocator/allocator.go:59 Creating GPU Slave Pod: owner-pod-slave-pod-40a529 for Owner Pod: owner-pod
2025-01-13T03:41:58.758Z INFO allocator/allocator.go:239 Checking Pods: owner-pod-slave-pod-40a529 state
2025-01-13T03:41:58.760Z INFO allocator/allocator.go:265 Pod: owner-pod-slave-pod-40a529 creating
2025-01-13T03:41:58.762Z INFO allocator/allocator.go:265 Pod: owner-pod-slave-pod-40a529 creating
2025-01-13T03:41:58.763Z INFO allocator/allocator.go:265 Pod: owner-pod-slave-pod-40a529 creating
2025-01-13T03:41:58.765Z INFO allocator/allocator.go:265 Pod: owner-pod-slave-pod-40a529 creating
2025-01-13T03:42:00.142Z INFO allocator/allocator.go:278 Pods: owner-pod-slave-pod-40a529 are running
2025-01-13T03:42:00.142Z INFO allocator/allocator.go:84 Successfully create Slave Pod: owner-pod-slave-pod-40a529, for Owner Pod: owner-pod
2025-01-13T03:42:00.142Z INFO collector/collector.go:91 Updating GPU status
2025-01-13T03:42:00.143Z DEBUG collector/collector.go:130 GPU: /dev/nvidia0 allocated to Pod: owner-pod-slave-pod-40a529 in Namespace default
2025-01-13T03:42:00.143Z INFO collector/collector.go:136 GPU status update successfully
2025-01-13T03:42:00.143Z INFO gpu-mount/server.go:81 Start mounting, Total: 1 Current: 1
2025-01-13T03:42:00.143Z INFO util/util.go:19 Start mount GPU: {"MinorNumber":0,"DeviceFilePath":"/dev/nvidia0","UUID":"GPU-cf2f070a-ff5a-ee2b-16ac-047f7e9c16bb","State":"GPU_ALLOCATED_STATE","PodName":"owner-pod-slave-pod-40a529","Namespace":"default"} to Pod: owner-pod
2025-01-13T03:42:00.143Z INFO util/util.go:24 Pod :owner-pod container ID: a893d886e17a63b2b056cef9766df9ae2b0a1f130a2c6952ed029a0fe5b1b740
2025-01-13T03:42:00.143Z INFO util/util.go:35 Successfully get cgroup path: /kubepods.slice/kubepods-besteffort.slice/kubepods-besteffort-pod4fea5b3d_b5ff_4e7f_a0f6_31a9ba061196.slice/docker-a893d886e17a63b2b056cef9766df9ae2b0a1f130a2c6952ed029a0fe5b1b740.scope for Pod: owner-pod
2025-01-13T03:42:00.145Z ERROR cgroup/cgroup.go:148 Exec "echo 'c 195:0 rw' > /sys/fs/cgroup/devices/kubepods.slice/kubepods-besteffort.slice/kubepods-besteffort-pod4fea5b3d_b5ff_4e7f_a0f6_31a9ba061196.slice/docker-a893d886e17a63b2b056cef9766df9ae2b0a1f130a2c6952ed029a0fe5b1b740.scope/devices.allow" failed
2025-01-13T03:42:00.145Z ERROR cgroup/cgroup.go:149 Output: sh: 1: cannot create /sys/fs/cgroup/devices/kubepods.slice/kubepods-besteffort.slice/kubepods-besteffort-pod4fea5b3d_b5ff_4e7f_a0f6_31a9ba061196.slice/docker-a893d886e17a63b2b056cef9766df9ae2b0a1f130a2c6952ed029a0fe5b1b740.scope/devices.allow: Directory nonexistent
2025-01-13T03:42:00.145Z ERROR cgroup/cgroup.go:150 exit status 2
2025-01-13T03:42:00.145Z ERROR util/util.go:38 Add GPU {"MinorNumber":0,"DeviceFilePath":"/dev/nvidia0","UUID":"GPU-cf2f070a-ff5a-ee2b-16ac-047f7e9c16bb","State":"GPU_ALLOCATED_STATE","PodName":"owner-pod-slave-pod-40a529","Namespace":"default"}failed
2025-01-13T03:42:00.145Z ERROR gpu-mount/server.go:84 Mount GPU: {"MinorNumber":0,"DeviceFilePath":"/dev/nvidia0","UUID":"GPU-cf2f070a-ff5a-ee2b-16ac-047f7e9c16bb","State":"GPU_ALLOCATED_STATE","PodName":"owner-pod-slave-pod-40a529","Namespace":"default"} to Pod: owner-pod in Namespace: default failed
2025-01-13T03:42:00.145Z ERROR gpu-mount/server.go:85 exit status 2
I checked this filedir, there were no /sys/fs/cgroup/devices .
The text was updated successfully, but these errors were encountered:
I test this successfully in Cgroup v1, but Cgroup v2 was wrong.
Here are the logs :
2025-01-13T03:41:56.696Z INFO GPUMounter-worker/main.go:15 Service Starting...
2025-01-13T03:41:56.696Z INFO gpu-mount/server.go:22 Creating gpu mounter
2025-01-13T03:41:56.696Z INFO allocator/allocator.go:28 Creating gpu allocator
2025-01-13T03:41:56.696Z INFO collector/collector.go:24 Creating gpu collector
2025-01-13T03:41:56.696Z INFO collector/collector.go:42 Start get gpu info
2025-01-13T03:41:56.704Z INFO collector/collector.go:53 GPU Num: 1
2025-01-13T03:41:56.710Z INFO collector/collector.go:91 Updating GPU status
2025-01-13T03:41:56.711Z INFO collector/collector.go:136 GPU status update successfully
2025-01-13T03:41:56.711Z INFO collector/collector.go:36 Successfully update gpu status
2025-01-13T03:41:56.711Z INFO allocator/allocator.go:35 Successfully created gpu collector
2025-01-13T03:41:56.711Z INFO gpu-mount/server.go:29 Successfully created gpu allocator
2025-01-13T03:41:56.711Z INFO GPUMounter-worker/main.go:22 Successfully created gpu mounter
2025-01-13T03:41:58.732Z INFO gpu-mount/server.go:35 AddGPU Service Called
2025-01-13T03:41:58.732Z INFO gpu-mount/server.go:36 request: pod_name:"owner-pod" namespace:"default" gpu_num:1
2025-01-13T03:41:58.750Z INFO gpu-mount/server.go:55 Successfully get Pod: default in cluster
2025-01-13T03:41:58.750Z INFO allocator/allocator.go:159 Get pod default/owner-pod mount type
2025-01-13T03:41:58.750Z INFO collector/collector.go:91 Updating GPU status
2025-01-13T03:41:58.750Z INFO collector/collector.go:136 GPU status update successfully
2025-01-13T03:41:58.758Z INFO allocator/allocator.go:59 Creating GPU Slave Pod: owner-pod-slave-pod-40a529 for Owner Pod: owner-pod
2025-01-13T03:41:58.758Z INFO allocator/allocator.go:239 Checking Pods: owner-pod-slave-pod-40a529 state
2025-01-13T03:41:58.760Z INFO allocator/allocator.go:265 Pod: owner-pod-slave-pod-40a529 creating
2025-01-13T03:41:58.762Z INFO allocator/allocator.go:265 Pod: owner-pod-slave-pod-40a529 creating
2025-01-13T03:41:58.763Z INFO allocator/allocator.go:265 Pod: owner-pod-slave-pod-40a529 creating
2025-01-13T03:41:58.765Z INFO allocator/allocator.go:265 Pod: owner-pod-slave-pod-40a529 creating
2025-01-13T03:42:00.142Z INFO allocator/allocator.go:278 Pods: owner-pod-slave-pod-40a529 are running
2025-01-13T03:42:00.142Z INFO allocator/allocator.go:84 Successfully create Slave Pod: owner-pod-slave-pod-40a529, for Owner Pod: owner-pod
2025-01-13T03:42:00.142Z INFO collector/collector.go:91 Updating GPU status
2025-01-13T03:42:00.143Z DEBUG collector/collector.go:130 GPU: /dev/nvidia0 allocated to Pod: owner-pod-slave-pod-40a529 in Namespace default
2025-01-13T03:42:00.143Z INFO collector/collector.go:136 GPU status update successfully
2025-01-13T03:42:00.143Z INFO gpu-mount/server.go:81 Start mounting, Total: 1 Current: 1
2025-01-13T03:42:00.143Z INFO util/util.go:19 Start mount GPU: {"MinorNumber":0,"DeviceFilePath":"/dev/nvidia0","UUID":"GPU-cf2f070a-ff5a-ee2b-16ac-047f7e9c16bb","State":"GPU_ALLOCATED_STATE","PodName":"owner-pod-slave-pod-40a529","Namespace":"default"} to Pod: owner-pod
2025-01-13T03:42:00.143Z INFO util/util.go:24 Pod :owner-pod container ID: a893d886e17a63b2b056cef9766df9ae2b0a1f130a2c6952ed029a0fe5b1b740
2025-01-13T03:42:00.143Z INFO util/util.go:35 Successfully get cgroup path: /kubepods.slice/kubepods-besteffort.slice/kubepods-besteffort-pod4fea5b3d_b5ff_4e7f_a0f6_31a9ba061196.slice/docker-a893d886e17a63b2b056cef9766df9ae2b0a1f130a2c6952ed029a0fe5b1b740.scope for Pod: owner-pod
2025-01-13T03:42:00.145Z ERROR cgroup/cgroup.go:148 Exec "echo 'c 195:0 rw' > /sys/fs/cgroup/devices/kubepods.slice/kubepods-besteffort.slice/kubepods-besteffort-pod4fea5b3d_b5ff_4e7f_a0f6_31a9ba061196.slice/docker-a893d886e17a63b2b056cef9766df9ae2b0a1f130a2c6952ed029a0fe5b1b740.scope/devices.allow" failed
2025-01-13T03:42:00.145Z ERROR cgroup/cgroup.go:149 Output: sh: 1: cannot create /sys/fs/cgroup/devices/kubepods.slice/kubepods-besteffort.slice/kubepods-besteffort-pod4fea5b3d_b5ff_4e7f_a0f6_31a9ba061196.slice/docker-a893d886e17a63b2b056cef9766df9ae2b0a1f130a2c6952ed029a0fe5b1b740.scope/devices.allow: Directory nonexistent
2025-01-13T03:42:00.145Z ERROR cgroup/cgroup.go:150 exit status 2
2025-01-13T03:42:00.145Z ERROR util/util.go:38 Add GPU {"MinorNumber":0,"DeviceFilePath":"/dev/nvidia0","UUID":"GPU-cf2f070a-ff5a-ee2b-16ac-047f7e9c16bb","State":"GPU_ALLOCATED_STATE","PodName":"owner-pod-slave-pod-40a529","Namespace":"default"}failed
2025-01-13T03:42:00.145Z ERROR gpu-mount/server.go:84 Mount GPU: {"MinorNumber":0,"DeviceFilePath":"/dev/nvidia0","UUID":"GPU-cf2f070a-ff5a-ee2b-16ac-047f7e9c16bb","State":"GPU_ALLOCATED_STATE","PodName":"owner-pod-slave-pod-40a529","Namespace":"default"} to Pod: owner-pod in Namespace: default failed
2025-01-13T03:42:00.145Z ERROR gpu-mount/server.go:85 exit status 2
I checked this filedir, there were no /sys/fs/cgroup/devices .
The text was updated successfully, but these errors were encountered: